See It All: Contextualized Late Aggregation for 3D Dense Captioning
Minjung Kim, Hyung Suk Lim, Seung Hwan Kim, Soonyoung Lee, Bumsoo Kim,, Gunhee Kim

TL;DR
This paper introduces SIA, a transformer-based approach for 3D dense captioning that uses late aggregation of context and object queries to improve caption quality and localization accuracy.
Contribution
The paper proposes a novel late aggregation paradigm with separate context and instance queries, enhancing 3D dense captioning performance over prior methods.
Findings
Significant improvement over previous methods on benchmark datasets.
Effective use of separate queries for context and objects.
Enhanced caption quality through a new aggregator.
Abstract
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need
