Grounded 3D-LLM with Referent Tokens
Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan, Lyu, Dahua Lin, Jiangmiao Pang

TL;DR
This paper introduces Grounded 3D-LLM, a unified generative model for 3D scene understanding that uses referent tokens and large-scale datasets to perform diverse vision tasks with leading performance.
Contribution
It presents a novel 3D large multi-modal model that unifies various 3D vision tasks within a generative framework using referent tokens and contrastive pre-training.
Findings
Achieves state-of-the-art results on multiple 3D benchmarks.
Effectively handles both open-ended and close-ended 3D vision tasks.
Demonstrates broad applicability across diverse 3D scene understanding tasks.
Abstract
Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Tunneling and Rock Mechanics · Natural Language Processing Techniques
