Exploring Temporal Event Cues for Dense Video Captioning in Cyclic   Co-learning

Zhuyang Xie; Yan Yang; Yankai Yu; Jie Wang; Yongquan Jiang; Xiao Wu

arXiv:2412.11467·cs.CV·December 17, 2024

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Zhuyang Xie, Yan Yang, Yankai Yu, Jie Wang, Yongquan Jiang, Xiao Wu

PDF

Open Access

TL;DR

This paper introduces MCCL, a dense video captioning model that uses cyclic co-learning and concept detection to improve event localization and description in untrimmed videos, achieving state-of-the-art results.

Contribution

The paper proposes a novel cyclic co-learning framework that integrates weakly supervised concept detection with captioning for enhanced dense video captioning.

Findings

01

Achieves state-of-the-art performance on ActivityNet Captions.

02

Demonstrates effectiveness of cyclic co-learning in video captioning.

03

Improves semantic perception and event localization accuracy.

Abstract

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Subtitles and Audiovisual Media · Multimodal Machine Learning Applications

MethodsContrastive Learning