TL;DR
This paper introduces VI-Probe, a framework to distinguish whether large vision-language models perceive visual changes or rely on memorized patterns, revealing diverse underlying mechanisms across models.
Contribution
The study presents a systematic probing framework with controlled visual illusions to analyze perception versus recall in VLMs, moving beyond average accuracy measures.
Findings
GPT-5 shows memory override behavior.
Claude-Opus-4.1 exhibits perception-memory competition.
Qwen variants indicate visual-processing limits.
Abstract
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
