Direction-aware 3D Large Multimodal Models
Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu

TL;DR
This paper introduces a new paradigm for direction-aware 3D large multimodal models by automatically recovering and aligning ego poses in point cloud data, significantly improving model performance on spatial reasoning tasks.
Contribution
The work proposes PoseRecover and PoseAlign, two novel methods for automatically recovering and aligning ego poses in point cloud benchmarks, enabling direction-aware 3D multimodal modeling.
Findings
Improved ScanRefer mIoU by 30.0%
Enhanced Scan2Cap LLM-as-judge accuracy by 11.7%
Consistent performance gains across multiple 3D LMM backbones
Abstract
3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
