See It All: Contextualized Late Aggregation for 3D Dense Captioning

Minjung Kim; Hyung Suk Lim; Seung Hwan Kim; Soonyoung Lee; Bumsoo Kim,; Gunhee Kim

arXiv:2408.07648·cs.CV·August 15, 2024

See It All: Contextualized Late Aggregation for 3D Dense Captioning

Minjung Kim, Hyung Suk Lim, Seung Hwan Kim, Soonyoung Lee, Bumsoo Kim,, Gunhee Kim

PDF

Open Access 1 Video

TL;DR

This paper introduces SIA, a transformer-based approach for 3D dense captioning that uses late aggregation of context and object queries to improve caption quality and localization accuracy.

Contribution

The paper proposes a novel late aggregation paradigm with separate context and instance queries, enhancing 3D dense captioning performance over prior methods.

Findings

01

Significant improvement over previous methods on benchmark datasets.

02

Effective use of separate queries for context and objects.

03

Enhanced caption quality through a new aggregator.

Abstract

3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

See It All: Contextualized Late Aggregation for 3D Dense Captioning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsSoftmax · Attention Is All You Need