Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
Seung Hyup Baek, Jimin Lee, Hyeongkeun Lee, and Jae Won Cho

TL;DR
This paper introduces role-specific queries with overlap suppression loss for dense video captioning, improving event localization and captioning coherence by separating tasks and reducing redundancy.
Contribution
It proposes a novel framework with role-specific queries, contrastive alignment, and overlap suppression loss to enhance dense video captioning performance.
Findings
Improved localization accuracy on YouCook2 and ActivityNet Captions.
Enhanced semantic richness in generated captions.
Reduced temporal redundancy and multi-task interference.
Abstract
Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
