Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

Seung Hyup Baek; Jimin Lee; Hyeongkeun Lee; and Jae Won Cho

arXiv:2603.11439·cs.CV·March 13, 2026

Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

Seung Hyup Baek, Jimin Lee, Hyeongkeun Lee, and Jae Won Cho

PDF

Open Access

TL;DR

This paper introduces role-specific queries with overlap suppression loss for dense video captioning, improving event localization and captioning coherence by separating tasks and reducing redundancy.

Contribution

It proposes a novel framework with role-specific queries, contrastive alignment, and overlap suppression loss to enhance dense video captioning performance.

Findings

01

Improved localization accuracy on YouCook2 and ActivityNet Captions.

02

Enhanced semantic richness in generated captions.

03

Reduced temporal redundancy and multi-task interference.

Abstract

Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis