Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia; Weiliang Meng; Zenghuang Fu; Yiheng Li; Qi Zeng; Yifan Zhang; Ju Xin; Rongtao Xu; Jiguang Zhang; Xiaopeng Zhang

arXiv:2511.10134·cs.CV·November 14, 2025

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia, Weiliang Meng, Zenghuang Fu, Yiheng Li, Qi Zeng, Yifan Zhang, Ju Xin, Rongtao Xu, Jiguang Zhang, Xiaopeng Zhang

PDF

Open Access

TL;DR

This paper introduces CACMI, an explicit temporal-semantic modeling framework for dense video captioning that captures temporal coherence and semantics through cross-modal interaction, outperforming previous implicit models.

Contribution

The paper proposes a novel explicit modeling framework, CACMI, that enhances dense video captioning by integrating temporal and semantic information via cross-modal and context-aware mechanisms.

Findings

01

CACMI achieves state-of-the-art results on ActivityNet Captions.

02

CACMI outperforms existing methods in dense video captioning metrics.

03

Extensive experiments validate the effectiveness of explicit temporal-semantic modeling.

Abstract

Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis