Grounded 3D-LLM with Referent Tokens

Yilun Chen; Shuai Yang; Haifeng Huang; Tai Wang; Runsen Xu; Ruiyuan; Lyu; Dahua Lin; Jiangmiao Pang

arXiv:2405.10370·cs.CV·November 19, 2024·1 cites

Grounded 3D-LLM with Referent Tokens

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan, Lyu, Dahua Lin, Jiangmiao Pang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Grounded 3D-LLM, a unified generative model for 3D scene understanding that uses referent tokens and large-scale datasets to perform diverse vision tasks with leading performance.

Contribution

It presents a novel 3D large multi-modal model that unifies various 3D vision tasks within a generative framework using referent tokens and contrastive pre-training.

Findings

01

Achieves state-of-the-art results on multiple 3D benchmarks.

02

Effectively handles both open-ended and close-ended 3D vision tasks.

03

Demonstrates broad applicability across diverse 3D scene understanding tasks.

Abstract

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OpenRobotLab/Grounded_3D-LLM
pytorchOfficial

Datasets

ShuaiYang03/Grounded_3D_LLM_with_Referent_Tokens_Dataset
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Tunneling and Rock Mechanics · Natural Language Processing Techniques