3D Vision and Language Pretraining with Large-Scale Synthetic Data
Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, and Yang Liu

TL;DR
This paper introduces SynVL3D, a large-scale synthetic 3D scene-text dataset, and a unified Transformer model for 3D vision-language pretraining, achieving state-of-the-art results in various downstream tasks.
Contribution
The paper creates a comprehensive synthetic dataset for 3D vision-language pretraining and proposes a domain adaptation method to improve real-world task performance.
Findings
Achieves state-of-the-art results on visual grounding, dense captioning, and question answering.
Demonstrates the effectiveness of synthetic data and domain adaptation in 3D vision-language tasks.
Provides a scalable approach to enhance 3D-VLP with low-cost synthetic data.
Abstract
3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
