TL;DR
This paper investigates whether vision-language models genuinely re-examine images during reasoning or merely mimic such behavior, revealing that models often fail to detect visual swaps and tend to say rather than see.
Contribution
The study introduces VisualSwap and VS-Bench to evaluate visual re-examination, showing models' limited ability to detect image swaps and the impact of user instructions on grounding.
Findings
Models' accuracy drops up to 60% when images are swapped.
Thinking models are nearly 3x more vulnerable than instructed models.
User instructions improve visual grounding, self-reflection does not.
Abstract
Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
