SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene   Understanding

Baoxiong Jia; Yixin Chen; Huangyue Yu; Yan Wang; Xuesong Niu; Tengyu; Liu; Qing Li; Siyuan Huang

arXiv:2401.09340·cs.CV·September 25, 2024·2 cites

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu, Liu, Qing Li, Siyuan Huang

PDF

Open Access

TL;DR

SceneVerse introduces a large-scale 3D vision-language dataset and a unified pre-training framework, significantly advancing grounded scene understanding in 3D environments with state-of-the-art results.

Contribution

The paper presents the first million-scale 3D vision-language dataset and a novel pre-training framework, addressing key challenges in 3D grounded scene understanding.

Findings

01

Achieved state-of-the-art results on 3D visual grounding benchmarks.

02

Demonstrated effective zero-shot transfer in 3D vision-language tasks.

03

Showcased the scalability and effectiveness of SceneVerse and GPS.

Abstract

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · 3D Surveying and Cultural Heritage

MethodsGreedy Policy Search