Implicit Location-Caption Alignment via Complementary Masking for   Weakly-Supervised Dense Video Captioning

Shiping Ge; Qiang Chen; Zhiwei Jiang; Yafeng Yin; Liu Qin; Ziyao Chen,; Qing Gu

arXiv:2412.12791·cs.CV·January 28, 2025

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Liu Qin, Ziyao Chen,, Qing Gu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel weakly-supervised dense video captioning method using complementary masking to implicitly align event locations with captions, simplifying localization without explicit annotations.

Contribution

The proposed approach employs a dual-mode captioning and mask generation module to implicitly align event locations and captions, reducing complexity compared to existing explicit methods.

Findings

01

Outperforms existing weakly-supervised methods

02

Achieves competitive results with fully-supervised approaches

03

Effective implicit alignment of event locations and captions

Abstract

Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ShipingGe/ILCACM
pytorchOfficial

Videos

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization