3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Ziyu Zhu; Xiaojian Ma; Yixin Chen; Zhidong Deng; Siyuan Huang; Qing Li

arXiv:2308.04352·cs.CV·August 9, 2023·1 cites

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li

PDF

Open Access 1 Repo

TL;DR

3D-VisTA introduces a simple, unified Transformer model for 3D vision-language tasks, leveraging a new large-scale dataset and achieving state-of-the-art results with high data efficiency.

Contribution

The paper presents 3D-VisTA, a pre-trained Transformer that simplifies 3D-VL modeling without complex modules, and introduces ScanScribe, a large-scale 3D scene-text dataset for pre-training.

Findings

01

Achieves state-of-the-art performance on 3D-VL tasks.

02

Demonstrates strong results with limited annotated data.

03

Simplifies 3D-VL modeling with self-attention layers.

Abstract

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

3d-vista/3D-VisTA
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Attention Dropout · Cosine Annealing · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia? · Label Smoothing · Linear Layer