Bi-directional Contextual Attention for 3D Dense Captioning

Minjung Kim; Hyung Suk Lim; Soonyoung Lee; Bumsoo Kim; Gunhee Kim

arXiv:2408.06662·cs.CV·August 14, 2024

Bi-directional Contextual Attention for 3D Dense Captioning

Minjung Kim, Hyung Suk Lim, Soonyoung Lee, Bumsoo Kim, Gunhee Kim

PDF

Open Access

TL;DR

This paper introduces BiCA, a transformer-based approach with bi-directional attention for 3D dense captioning, effectively balancing local object localization and global contextual understanding to improve caption quality.

Contribution

The paper proposes a novel BiCA framework that uses bi-directional contextual attention to enhance 3D dense captioning by integrating local and global scene information.

Findings

01

Significant improvement over prior methods on benchmark datasets.

02

Enhanced localization accuracy and caption quality.

03

Effective balancing of local and global contextual information.

Abstract

3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene. Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationships that exist across the entire global scene, not only near the object itself. Second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce BiCA, a transformer encoder-decoder pipeline that engages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need