Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions
Xuesong Wang, Harry Wang

TL;DR
This paper introduces a tool-guided inference framework for vision-language models to better interpret optical illusions, achieving improved generalization without additional training.
Contribution
It proposes a generic-tool-plus-routing approach that enhances VLMs' ability to handle diverse illusions by using image manipulation tools and reasoning chains, without model training.
Findings
Performance remained consistent on unfamiliar illusion variants.
Identified a positive-detection bias linked to training data imbalance.
Observed a dissociation between spatial reasoning and logical inference.
Abstract
Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as "real" regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
