Don't Blink: Evidence Collapse during Multimodal Reasoning

Suresh Raghu; Satwik Pandey

arXiv:2604.04207·cs.AI·April 7, 2026

Don't Blink: Evidence Collapse during Multimodal Reasoning

Suresh Raghu, Satwik Pandey

PDF

TL;DR

This paper investigates how reasoning vision-language models often lose visual grounding during reasoning, leading to overconfident but ungrounded predictions, and proposes task-aware monitoring strategies to improve safety.

Contribution

It uncovers the evidence-collapse phenomenon in reasoning VLMs, analyzes the limitations of current uncertainty signals, and introduces a task-conditional vision veto to enhance safe deployment.

Findings

01

Attention to evidence regions drops significantly during reasoning.

02

Full-response entropy is a reliable uncertainty measure across datasets.

03

A vision veto reduces risk by up to 1.9 percentage points at 90% coverage.

Abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.