End-to-End 3D Dense Captioning with Vote2Cap-DETR

Sijin Chen; Hongyuan Zhu; Xin Chen; Yinjie Lei; Tao Chen; Gang YU

arXiv:2301.02508·cs.CV·January 9, 2023

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU

PDF

Open Access 1 Repo

TL;DR

This paper introduces Vote2Cap-DETR, a transformer-based framework for 3D dense captioning that simplifies the pipeline by integrating detection and captioning into a single stage, achieving state-of-the-art results.

Contribution

The paper presents a full transformer encoder-decoder architecture with learnable vote queries for 3D dense captioning, eliminating the need for hand-crafted components and two-stage detection-captioning pipelines.

Findings

01

Surpasses state-of-the-art by 11.13% [email protected] on ScanRefer

02

Achieves 7.11% improvement on Nr3D dataset

03

Demonstrates effectiveness of a unified transformer framework for 3D dense captioning

Abstract

3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated ``detect-then-describe'' pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular \textbf{DE}tection \textbf{TR}ansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ch3cook-fdu/vote2cap-detr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition