End-to-End 3D Dense Captioning with Vote2Cap-DETR
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU

TL;DR
This paper introduces Vote2Cap-DETR, a transformer-based framework for 3D dense captioning that simplifies the pipeline by integrating detection and captioning into a single stage, achieving state-of-the-art results.
Contribution
The paper presents a full transformer encoder-decoder architecture with learnable vote queries for 3D dense captioning, eliminating the need for hand-crafted components and two-stage detection-captioning pipelines.
Findings
Surpasses state-of-the-art by 11.13% [email protected] on ScanRefer
Achieves 7.11% improvement on Nr3D dataset
Demonstrates effectiveness of a unified transformer framework for 3D dense captioning
Abstract
3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated ``detect-then-describe'' pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular \textbf{DE}tection \textbf{TR}ansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
