3D Vision and Language Pretraining with Large-Scale Synthetic Data

Dejie Yang; Zhu Xu; Wentao Mo; Qingchao Chen; Siyuan Huang; and Yang Liu

arXiv:2407.06084·cs.CV·July 9, 2024

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, and Yang Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces SynVL3D, a large-scale synthetic 3D scene-text dataset, and a unified Transformer model for 3D vision-language pretraining, achieving state-of-the-art results in various downstream tasks.

Contribution

The paper creates a comprehensive synthetic dataset for 3D vision-language pretraining and proposes a domain adaptation method to improve real-world task performance.

Findings

01

Achieves state-of-the-art results on visual grounding, dense captioning, and question answering.

02

Demonstrates the effectiveness of synthetic data and domain adaptation in 3D vision-language tasks.

03

Provides a scalable approach to enhance 3D-VLP with low-cost synthetic data.

Abstract

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idejie/3DSyn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam