Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie, Lei, Gang Yu, Taihao Li, and Tao Chen

TL;DR
This paper introduces Vote2Cap-DETR++: a transformer-based framework that decouples localization and captioning tasks in 3D dense captioning, leading to improved accuracy and efficiency over traditional detect-then-describe methods.
Contribution
The paper proposes Vote2Cap-DETR++ with task-specific queries and iterative spatial refinement, enhancing 3D dense captioning by reducing errors and improving localization and description accuracy.
Findings
Outperforms traditional methods on ScanRefer and Nr3D datasets
Decoupling localization and captioning improves task-specific feature learning
Iterative refinement accelerates convergence and enhances localization accuracy
Abstract
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
