Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
Zijian Liu, Sihan Cao, Pengcheng Zheng, Kuien Liu, Caiyan Qin, Xiaolin Qin, Jiwei Wei, Chaoning Zhang

TL;DR
This paper introduces a training-free method called DTR that rebalances temporal evidence in Video-LLMs to reduce hallucinations, improving robustness without sacrificing performance.
Contribution
It identifies a model-specific temporal bias in Video-LLMs and proposes DTR, a novel inference technique to mitigate hallucinations by balancing attention across video frames.
Findings
DTR significantly reduces hallucinations across multiple Video-LLM models.
DTR maintains competitive video understanding performance.
DTR improves inference efficiency and robustness.
Abstract
Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
