Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao

TL;DR
Chat-3D introduces a universal dialogue system for 3D scenes by aligning 3D representations with large language models, enabling perception, reasoning, and conversation about 3D environments with high accuracy.
Contribution
The paper presents a novel method for aligning 3D scene representations with LLMs, creating the first universal dialogue system capable of understanding and reasoning about 3D scenes.
Findings
Achieves 75.6% relative score compared to GPT-4 on 3D instruction dataset.
Demonstrates strong spatial reasoning and instruction comprehension in 3D environments.
Constructs a high-quality object-centric 3D instruction dataset.
Abstract
3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection
