Direction-aware 3D Large Multimodal Models

Quan Liu; Weihao Xuan; Junjue Wang; Naoto Yokoya; Ling Shao; Shijian Lu

arXiv:2602.19063·cs.CV·February 24, 2026

Direction-aware 3D Large Multimodal Models

Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu

PDF

Open Access

TL;DR

This paper introduces a new paradigm for direction-aware 3D large multimodal models by automatically recovering and aligning ego poses in point cloud data, significantly improving model performance on spatial reasoning tasks.

Contribution

The work proposes PoseRecover and PoseAlign, two novel methods for automatically recovering and aligning ego poses in point cloud benchmarks, enabling direction-aware 3D multimodal modeling.

Findings

01

Improved ScanRefer mIoU by 30.0%

02

Enhanced Scan2Cap LLM-as-judge accuracy by 11.7%

03

Consistent performance gains across multiple 3D LMM backbones

Abstract

3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization