3D Scene Graph Guided Vision-Language Pre-training

Hao Liu; Yanni Ma; Yan Liu; Haihong Xiao; Ying He

arXiv:2411.18666·cs.CV·December 2, 2024

3D Scene Graph Guided Vision-Language Pre-training

Hao Liu, Yanni Ma, Yan Liu, Haihong Xiao, Ying He

PDF

Open Access

TL;DR

This paper introduces a unified 3D scene graph-guided vision-language pre-training framework that leverages scene graphs and contrastive learning to improve 3D reasoning tasks without task-specific modules.

Contribution

It proposes a general-purpose pre-training approach using scene graphs, contrastive learning, and masked modality learning for diverse 3D vision-language tasks.

Findings

01

Achieves state-of-the-art results on 3D visual grounding.

02

Improves performance on 3D dense captioning.

03

Enhances 3D question answering accuracy.

Abstract

3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms. Therefore, these methods focus on a limited range of reasoning sub-tasks and rely heavily on the hand-crafted modules and auxiliary losses. This highlights the need for a simpler, unified and general-purpose model. In this paper, we leverage the inherent connection between 3D scene graphs and natural language, proposing a 3D scene graph-guided vision-language pre-training (VLP) framework. Our approach utilizes modality encoders, graph convolutional layers and cross-attention layers to learn universal representations that adapt to a variety of 3D VL reasoning tasks, thereby eliminating the need for task-specific designs. The pre-training objectives…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · ALIGN · Focus