UniT3D: A Unified Transformer for 3D Dense Captioning and Visual   Grounding

Dave Zhenyu Chen; Ronghang Hu; Xinlei Chen; Matthias Nie{\ss}ner,; Angel X. Chang

arXiv:2212.00836·cs.CV·December 5, 2022·1 cites

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nie{\ss}ner,, Angel X. Chang

PDF

Open Access

TL;DR

UniT3D introduces a unified transformer model that jointly learns 3D dense captioning and visual grounding, leveraging shared multimodal representations and joint pre-training to improve performance across both tasks.

Contribution

The paper presents UniT3D, a fully unified transformer architecture that explicitly models the shared nature of 3D dense captioning and visual grounding, enabling joint learning and leveraging diverse data sources.

Findings

01

Significant performance improvements on 3D dense captioning.

02

Enhanced 3D visual grounding accuracy.

03

Effective use of synthesized 2D data for pre-training.

Abstract

Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning