Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?
Taiga Shinozaki, Tomoki Doi, Amane Watahiki, Satoshi Nishida, Hitomi Yanaka

TL;DR
This study investigates whether large vision-language models can distinguish between actual and apparent features in optical illusions, revealing that models may rely on prior knowledge rather than true visual comprehension.
Contribution
Introduces a novel VQA dataset with genuine and fake illusions to evaluate LVLMs' ability to differentiate actual and apparent features.
Findings
LVLMs often give similar answers to genuine and fake illusions.
Models may rely on prior knowledge rather than visual understanding.
The dataset enables more precise assessment of machine perception of illusions.
Abstract
Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Face Recognition and Perception
