MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du; Liao Duan; Jian Liu; Tao Han; Yujia Zhang; Eric Liu; Xi Chen; Song Guo

arXiv:2605.21954·cs.CV·May 22, 2026

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

PDF

1 Repo

TL;DR

This paper reveals that multimodal large language models (MLLMs) can identify relevant video segments during prefill but tend to shift attention away during answer generation, and proposes a read-then-regenerate method to improve temporal grounding accuracy.

Contribution

It uncovers the attention shift phenomenon in MLLMs during video temporal grounding and introduces a simple inference-time framework to enhance their temporal localization without retraining.

Findings

01

Attention heads focus on true event intervals during prefill.

02

Re-invoking MLLMs with focused visual context improves grounding accuracy.

03

Framework boosts performance on three VTG benchmarks by up to +3.5 mIoU.

Abstract

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://ddz16.github.io/mllmsknowwhen.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.