UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nie{\ss}ner,, Angel X. Chang

TL;DR
UniT3D introduces a unified transformer model that jointly learns 3D dense captioning and visual grounding, leveraging shared multimodal representations and joint pre-training to improve performance across both tasks.
Contribution
The paper presents UniT3D, a fully unified transformer architecture that explicitly models the shared nature of 3D dense captioning and visual grounding, enabling joint learning and leveraging diverse data sources.
Findings
Significant performance improvements on 3D dense captioning.
Enhanced 3D visual grounding accuracy.
Effective use of synthesized 2D data for pre-training.
Abstract
Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
