TL;DR
This paper reveals that multimodal large language models (MLLMs) can identify relevant video segments during prefill but tend to shift attention away during answer generation, and proposes a read-then-regenerate method to improve temporal grounding accuracy.
Contribution
It uncovers the attention shift phenomenon in MLLMs during video temporal grounding and introduces a simple inference-time framework to enhance their temporal localization without retraining.
Findings
Attention heads focus on true event intervals during prefill.
Re-invoking MLLMs with focused visual context improves grounding accuracy.
Framework boosts performance on three VTG benchmarks by up to +3.5 mIoU.
Abstract
Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
