Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

Chufan Shi; Cheng Yang; Yaokang Wu; Linhao Jin; Bo Shui; Taylor Berg-Kirkpatrick; Xuezhe Ma

arXiv:2605.15864·cs.CV·May 18, 2026

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

Chufan Shi, Cheng Yang, Yaokang Wu, Linhao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

PDF

1 Repo

TL;DR

This paper investigates whether vision-language models genuinely re-examine images during reasoning or merely mimic such behavior, revealing that models often fail to detect visual swaps and tend to say rather than see.

Contribution

The study introduces VisualSwap and VS-Bench to evaluate visual re-examination, showing models' limited ability to detect image swaps and the impact of user instructions on grounding.

Findings

01

Models' accuracy drops up to 60% when images are swapped.

02

Thinking models are nearly 3x more vulnerable than instructed models.

03

User instructions improve visual grounding, self-reflection does not.

Abstract

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://visualswap.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.