TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Wei-Yuan Cheng; Kai-Po Chang; Chi-Pin Huang; Fu-En Yang; and Yu-Chiang Frank Wang

arXiv:2601.02908·cs.CV·January 7, 2026

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Wei-Yuan Cheng, Kai-Po Chang, Chi-Pin Huang, Fu-En Yang, and Yu-Chiang Frank Wang

PDF

Open Access

TL;DR

TA-Prompting introduces Temporal Anchors to improve the localization and description of events in untrimmed videos, significantly enhancing dense video captioning and temporal understanding by better grounding and coherence.

Contribution

The paper proposes TA-Prompting with Temporal Anchors to precisely localize events and improve VideoLLMs' performance on dense captioning and temporal tasks.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets.

02

Improves event boundary detection and caption grounding.

03

Enhances temporal understanding in dense video captioning.

Abstract

Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis