IllusionBench+: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models

Yiming Zhang; Zicheng Zhang; Xinyi Wei; Xiaohong Liu; Guangtao Zhai; Xiongkuo Min

arXiv:2501.00848·cs.CV·June 23, 2025

IllusionBench+: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models

Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min

PDF

Open Access

TL;DR

IllusionBench+ is a large-scale, comprehensive benchmark dataset designed to evaluate vision-language models' understanding of visual illusions, including real-world and classical illusions, revealing current limitations and hallucination issues.

Contribution

This work introduces IllusionBench+, the largest dataset of its kind, to systematically assess and analyze the perceptual abilities of state-of-the-art vision-language models on visual illusions.

Findings

01

Top model GPT-4o achieves 80.59% accuracy on true-or-false tasks.

02

Models show significant hallucination issues, especially on trap illusions.

03

Current models still lag behind human performance in illusion understanding.

Abstract

Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions, especially in real-world scenarios. Existing benchmarks focus on classical cognitive illusions, which have been learned by state-of-the-art (SOTA) VLMs, revealing issues such as hallucinations and limited perceptual abilities. To address this gap, we introduce IllusionBench, a comprehensive visual illusion dataset that encompasses not only classic cognitive illusions but also real-world scene illusions. This dataset features 1,051 images, 5,548 question-answer pairs, and 1,051 golden text descriptions that address the presence, causes, and content of the illusions. We evaluate ten SOTA VLMs on this dataset using true-or-false, multiple-choice, and open-ended tasks. In addition to real-world illusions, we design trap illusions that resemble classical patterns but differ in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsFocus