Don't Blink: Evidence Collapse during Multimodal Reasoning
Suresh Raghu, Satwik Pandey

TL;DR
This paper investigates how reasoning vision-language models often lose visual grounding during reasoning, leading to overconfident but ungrounded predictions, and proposes task-aware monitoring strategies to improve safety.
Contribution
It uncovers the evidence-collapse phenomenon in reasoning VLMs, analyzes the limitations of current uncertainty signals, and introduces a task-conditional vision veto to enhance safe deployment.
Findings
Attention to evidence regions drops significantly during reasoning.
Full-response entropy is a reliable uncertainty measure across datasets.
A vision veto reduces risk by up to 1.9 percentage points at 90% coverage.
Abstract
Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
