MIRAGE: The Illusion of Visual Understanding
Mohammad Asadi, Jack W. O'Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley

TL;DR
This paper reveals that multimodal AI models can generate detailed reasoning and perform well without actual visual input, exposing vulnerabilities in current evaluation methods and proposing a new benchmark for fair assessment.
Contribution
The paper uncovers mirage reasoning in multimodal models, demonstrates their high performance without images, and introduces B-Clean for unbiased evaluation of visual-language understanding.
Findings
Models generate detailed image descriptions without images.
Models achieve high benchmark scores without visual input.
Explicit instructions to guess reduce model performance.
Abstract
Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
