3D Scene Graph Guided Vision-Language Pre-training
Hao Liu, Yanni Ma, Yan Liu, Haihong Xiao, Ying He

TL;DR
This paper introduces a unified 3D scene graph-guided vision-language pre-training framework that leverages scene graphs and contrastive learning to improve 3D reasoning tasks without task-specific modules.
Contribution
It proposes a general-purpose pre-training approach using scene graphs, contrastive learning, and masked modality learning for diverse 3D vision-language tasks.
Findings
Achieves state-of-the-art results on 3D visual grounding.
Improves performance on 3D dense captioning.
Enhances 3D question answering accuracy.
Abstract
3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms. Therefore, these methods focus on a limited range of reasoning sub-tasks and rely heavily on the hand-crafted modules and auxiliary losses. This highlights the need for a simpler, unified and general-purpose model. In this paper, we leverage the inherent connection between 3D scene graphs and natural language, proposing a 3D scene graph-guided vision-language pre-training (VLP) framework. Our approach utilizes modality encoders, graph convolutional layers and cross-attention layers to learn universal representations that adapt to a variety of 3D VL reasoning tasks, thereby eliminating the need for task-specific designs. The pre-training objectives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · ALIGN · Focus
