Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Zhining Liu; Ziyi Chen; Hui Liu; Chen Luo; Xianfeng Tang; Suhang Wang; Joy Zeng; Zhenwei Dai; Zhan Shi; Tianxin Wei; Benoit Dumoulin; Hanghang Tong

arXiv:2510.17771·cs.AI·October 21, 2025

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, Hanghang Tong

PDF

Open Access 3 Reviews

TL;DR

This paper investigates why vision-language models sometimes fail despite perceiving the correct visual evidence, revealing that they often see but do not believe the evidence, and proposes an attention-based intervention to improve accuracy.

Contribution

It uncovers the disconnect between evidence perception and utilization in VLMs and introduces a training-free method to enhance their reasoning by highlighting evidence regions.

Findings

01

Deeper layers reliably attend to evidence regions

02

VLMs often perceive evidence even when answers are incorrect

03

Selective attention masking improves model accuracy

Abstract

Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper presents an insightful observation on VLMs’ behaviors toward images. Its visualizations and analyses further reveal how text and image interactions are modeled across different layers, showing that the encoding of semantic features first emerges in deeper layers. The inconsistency between attention maps (i.e., "seeing but not believing") highlights limitations of current VLM architectures, which is valuable for guiding future research. 2. To address this issue, the authors propose

Weaknesses

1. The method improves VLM performance by overlaying a salient mask on the input image, but it does not "fix the VLM’s attention behavior" (as the behavior of the VLM or attention is not changed). Additionally, the design of the algorithm will introduce extra cost, and also raises the convern about multi-turn/multi-image scenarios. 2. The proposed algorithm augments the brightness of different regions of the image. However, this augmentation will change the original image, causing information l

Reviewer 02Rating 6Confidence 3

Strengths

1. The authors conduct a thorough investigation of how different layers in VLMs process inputs and distribute attention, providing valuable insights for the community. 2. The proposed solution is simple yet effective, showing consistent improvements across eight different VLMs.

Weaknesses

1. It is important to evaluate cases where the model does not attend to the correct regions. How often does the model still answer correctly in such cases? Does performance degrade when highlighting regions based on incorrect attention? 2. The motivation is somewhat similar to [1], which also identifies this phenomenon and proposes attention-based approaches. This overlap reduces the novelty of the contribution, though the improvements of the method are still appreciated. [1] "Unveiling the Ig

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper is well written and easy to follow. The phenomenon of “seeing but not believing” is intriguing. 2. The experiments and ablation studies are comprehensive. 3. The proposed VEA approach provides both interpretability and performance improvement.

Weaknesses

1. Some experimental details are missing. For example, in Figures 2 and 3, it is unclear which models were used for attention map visualization. 2. Including more visual examples could strengthen and better support the overall narrative.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Neurobiology of Language and Bilingualism