Language-Image Models with 3D Understanding
Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue, Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Kr\"ahenb\"uhl, Yan Wang,, Marco Pavone

TL;DR
This paper introduces Cube-LLM, a large-scale multi-modal language model trained on a new 3D-aware dataset, demonstrating significant improvements in 3D perception and reasoning tasks through data scaling and prompting techniques.
Contribution
The paper presents a novel 3D-aware multi-modal language model, Cube-LLM, trained on the LV3D dataset, showing that data scaling alone enhances 3D perception without specialized architecture.
Findings
Cube-LLM outperforms baselines by 21.3 AP-BEV on Talk2Car.
Cube-LLM achieves 17.7 AP on DriveLM for 3D reasoning.
Cube-LLM attains 87.0 accuracy on refCOCO for 2D grounding.
Abstract
Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input…
Peer Reviews
Decision·ICLR 2025 Poster
1. The overall data pipeline is well-structured, extending LLAVA to a 3D data format with large-scale pretraining. It incorporates standardization of 2D/3D data labels (as shown in Fig. 3), unification of model I/O inputs, and the implementation of visual chain-of-thought (CoT) reasoning for step-by-step analysis. 2. The evaluation is comprehensive, covering various tasks such as 3D grounding (Talk2Car, DriveLM-grounding, Indoor Objectron, ARKitScenes, SUN-RGBD), QA (DriveLM-QA), 3D grounding a
1. The paper lacks an in-depth analysis of joint 2D and 3D training. It closely follows LLAVA1.5, with DINOv2 as the vision model and primarily contributes by consolidating 2D and 3D datasets. While 3D performance improvements could add value, the impact seems insufficient for acceptance. I would like more analysis on the effects of excluding 2D and 3D box pretraining on 3D/2D QA/grounding performance. 2. In my view, the Visual CoT would benefit from zooming in on selected parts (SoM [1]) to en
1. The authors proposed a new MLLM for 3D understanding, *i.e.*, Cube-LLM. To enable joint learning from both 2D and 3D domain knowledge, the authors proposed datasets with unified 2D and 3D formats, as well as training tasks on different data modalities. 2. Experiments results show the benefit of the proposed data+task training paradigm, with improved performance on 2D and 3D visual grounding, as well as tasks that require certain reasoning. 3. The proposed framework will enable future research
1. Experiments on DriveLM QA is comparing with weak baselines and the reasoning examples on DriveLM are with finetuning. For instance, how is the reasoning capabilities of the proposed Cube-LLM compared to methods with spatial reasoning training data, such as SpatialRGPT and SpatialVLM. I think this is an interesting topic to study the reasoning capabilities of the proposed method but current results are not very promising. 2. Table 4 shows that Cube-LLM outperforms previous specialist and gener
1. The exploration of LLM-based 3D reasoning is timely and relevant, given the nascent stage of 3D language-vision models compared to their 2D counterparts. One significant contribution lies in the authors’ pipeline for collecting datasets suited for training multimodal LLMs, which addresses the challenge of limited large-scale datasets for 3D tasks. 2. Experiments on two established 3D reasoning benchmarks, Talk2Car and DriveLM, show that the proposed method achieves superior performance, indi
1. The collected dataset primarily focuses on 3D captioning and grounding tasks, which restricts the breadth of 3D capabilities the model can support. Expanding the dataset to include other essential 3D reasoning capabilities, such as depth ordering, neighborhood relationships, object sizing, and directionality, would better align with real-world 3D applications. 2. While the experiments indicate that the proposed method performs effectively for 3D visual grounding, it remains unclear if the mo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
