LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with   3D-awareness

Chenming Zhu; Tai Wang; Wenwei Zhang; Jiangmiao Pang; Xihui Liu

arXiv:2409.18125·cs.CV·April 29, 2025·2 cites

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu

PDF

Open Access 1 Models

TL;DR

LLaVA-3D introduces a unified framework that enhances large multimodal models with 3D scene understanding by integrating 3D spatial context, enabling accurate 3D perception without sacrificing 2D capabilities and improving training efficiency.

Contribution

The paper presents a simple, effective method to adapt LLaVA for 3D understanding using 3D position embeddings and joint training, achieving state-of-the-art results and faster convergence.

Findings

01

Supports direct decoding of 3D spatial outputs like bounding boxes

02

Converges 3.5x faster than previous 3D LMMs

03

Maintains comparable 2D understanding and conversation abilities

Abstract

Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ChaimZhu/LLaVA-3D-7B
model· 780 dl· ♡ 6
780 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCollaboration in agile enterprises · BIM and Construction Integration

MethodsContrastive Language-Image Pre-training