Bi-directional Contextual Attention for 3D Dense Captioning
Minjung Kim, Hyung Suk Lim, Soonyoung Lee, Bumsoo Kim, Gunhee Kim

TL;DR
This paper introduces BiCA, a transformer-based approach with bi-directional attention for 3D dense captioning, effectively balancing local object localization and global contextual understanding to improve caption quality.
Contribution
The paper proposes a novel BiCA framework that uses bi-directional contextual attention to enhance 3D dense captioning by integrating local and global scene information.
Findings
Significant improvement over prior methods on benchmark datasets.
Enhanced localization accuracy and caption quality.
Effective balancing of local and global contextual information.
Abstract
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene. Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationships that exist across the entire global scene, not only near the object itself. Second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce BiCA, a transformer encoder-decoder pipeline that engages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need
