IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language   Models

Haz Sameen Shahgir; Khondker Salman Sayeed; Abhik Bhattacharjee; Wasi; Uddin Ahmad; Yue Dong; Rifat Shahriyar

arXiv:2403.15952·cs.CV·August 12, 2024·6 cites

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi, Uddin Ahmad, Yue Dong, Rifat Shahriyar

PDF

Open Access 1 Repo 2 Datasets

TL;DR

IllusionVQA introduces a challenging optical illusion dataset to evaluate vision-language models' understanding and reasoning, revealing their limitations in interpreting inherently unreasonable images compared to human performance.

Contribution

This paper presents IllusionVQA, a novel dataset of optical illusions designed to test VLMs' comprehension and localization abilities, highlighting their current weaknesses.

Findings

01

GPT4V achieves 62.99% accuracy in comprehension

02

VLMs perform poorly on localization tasks compared to humans

03

In-Context Learning and Chain-of-Thought reasoning degrade VLM performance

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csebuetnlp/illusionvqa
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Advanced Image and Video Retrieval Techniques