LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu

TL;DR
LLaVA-3D introduces a unified framework that enhances large multimodal models with 3D scene understanding by integrating 3D spatial context, enabling accurate 3D perception without sacrificing 2D capabilities and improving training efficiency.
Contribution
The paper presents a simple, effective method to adapt LLaVA for 3D understanding using 3D position embeddings and joint training, achieving state-of-the-art results and faster convergence.
Findings
Supports direct decoding of 3D spatial outputs like bounding boxes
Converges 3.5x faster than previous 3D LMMs
Maintains comparable 2D understanding and conversation abilities
Abstract
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCollaboration in agile enterprises · BIM and Construction Integration
MethodsContrastive Language-Image Pre-training
