When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition
Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu

TL;DR
This paper introduces FrameRepeat, a framework that improves video reasoning in multimodal models by automatically reinforcing important frames, addressing visual forgetting without extensive retraining.
Contribution
The paper proposes a novel, generalizable method using a lightweight frame scoring network and Add-One-In training strategy to enhance visual input retention in Video-LLMs.
Findings
Effective across multiple models and datasets
Reduces hallucinations and improves reasoning accuracy
Automates frame reinforcement without heavy retraining
Abstract
Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
