Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End   3D Dense Captioning

Sijin Chen; Hongyuan Zhu; Mingsheng Li; Xin Chen; Peng Guo; Yinjie; Lei; Gang Yu; Taihao Li; and Tao Chen

arXiv:2309.02999·cs.CV·September 7, 2023

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie, Lei, Gang Yu, Taihao Li, and Tao Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Vote2Cap-DETR++: a transformer-based framework that decouples localization and captioning tasks in 3D dense captioning, leading to improved accuracy and efficiency over traditional detect-then-describe methods.

Contribution

The paper proposes Vote2Cap-DETR++ with task-specific queries and iterative spatial refinement, enhancing 3D dense captioning by reducing errors and improving localization and description accuracy.

Findings

01

Outperforms traditional methods on ScanRefer and Nr3D datasets

02

Decoupling localization and captioning improves task-specific feature learning

03

Iterative refinement accelerates convergence and enhances localization accuracy

Abstract

3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ch3cook-fdu/vote2cap-detr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques