Unified Scene Representation and Reconstruction for 3D Large Language Models
Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang

TL;DR
This paper introduces Uni3DR^2, a unified 3D scene representation framework that enhances scene reconstruction and understanding for Large Language Models by integrating geometric and semantic features from pre-trained 2D models.
Contribution
We propose Uni3DR^2, a novel unified scene representation method that combines geometric and semantic features from pre-trained models for improved 3D reconstruction and LLM understanding.
Findings
Achieves +1.8% F-Score improvement on ScanNet reconstruction.
Increases BLEU-1 by +4.0% and +4.2% on ScanQA for LLM tasks.
Outperforms state-of-the-art methods using ground truth point clouds.
Abstract
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsSparse Evolutionary Training · Attentive Walk-Aggregating Graph Neural Network · Contrastive Language-Image Pre-training
