Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning

Hao Guo; Fei Wang; Junjie Chen; Yiqi Nie; Jiaqi Zhao; Qiankun Li; and Subin Huang

arXiv:2604.26250·cs.CV·April 30, 2026

Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning

Hao Guo, Fei Wang, Junjie Chen, Yiqi Nie, Jiaqi Zhao, Qiankun Li, and Subin Huang

PDF

TL;DR

This paper introduces Structured Qualitative Inference (SQI), a training-free framework that enhances the perceptual robustness of frozen Vision-Language Models against optical illusions by integrating systematic qualitative reasoning modules.

Contribution

The paper presents a novel, data-centric approach with three modules—axiomatic constraints, hierarchical scene decomposition, and counterfactual verification—that improve visual grounding without model fine-tuning.

Findings

01

SQI ranked 2nd in the DataCV 2026 Challenge for illusion understanding.

02

Experimental results show significant accuracy improvements across illusion categories.

03

SQI offers interpretability benefits without requiring model fine-tuning.

Abstract

While Vision-Language Models (VLMs) have achieved state-of-the-art performance in general visual tasks, their perceptual robustness remains remarkably brittle when confronted with optical illusions. These failures are often attributed to shortcut heuristics, where models prioritize linguistic priors and memorized prototypes over direct visual evidence. In this work, we propose Structured Qualitative Inference (SQI), a training-free, data-centric framework designed to fortify visual grounding in frozen VLMs. SQI addresses perceptual anomalies through three systematic modules: (1) Axiomatic Constraint Injection, which suppresses erroneous metric estimations and quantitative hallucinations; (2) Hierarchical Scene Decomposition, which decouples target visual manifolds from complex background distractors; and (3) Counterfactual Self-Verification, an adversarial reasoning step that mitigates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.