Unbiased Visual Reasoning with Controlled Visual Inputs
Zhaonan Li, Shijie Lu, Fei Wang, Jacob Dineen, Xiao Ye, Zhikun Xu, Siyi Liu, Young Min Cho, Bangzheng Li, Daniel Chang, Kenny Nguyen, Qizheng Yang, Muhao Chen, Ben Zhou

TL;DR
VISTA is a modular framework that improves unbiased visual reasoning by separating perception from reasoning, using an explicit information bottleneck and reinforcement learning to reduce reliance on spurious correlations.
Contribution
It introduces VISTA, a novel approach that decouples perception and reasoning in vision-language models, enhancing robustness and interpretability in visual question answering.
Findings
Significantly improves robustness to spurious correlations on SpuriVerse.
Transfers robustly across unseen VLM sensors.
Produces more neutral and visually grounded reasoning traces.
Abstract
End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on…
Peer Reviews
Decision·Submitted to ICLR 2026
1) The paper crisply identifies spurious-cue reliance and the conflation of perception and reasoning in end-to-end VLMs, motivating a modular remedy. 2) VISTA enforces an explicit information bottleneck between a text-only reasoner and a stateless VLM sensor, cleanly separating decision-making from raw pixels. 3) The sensor accepts only six classes of perception queries and rejects high-level inference, with a concrete policy and examples.
1) The proposed information bottleneck between the sensor and reasoner is conceptually interesting, but it may also introduce new risks. By restricting the reasoner’s access to full and detailed visual information, the model could miss critical cues needed for complex reasoning. Moreover, if the stateless visual sensor makes errors or misinterprets the scene, the reasoner has no means to recover or verify the missing context, potentially amplifying mistakes. The paper should further analyze and
Overall, I like the high-level motivation which limits the VLMs to do what they can do. For this direction, actually I expect to see more analysis from how to determine what VLMs can do well, instead of pretty unclear queries accept or reject in a straightforward way. Anyway, targeting spurious visual correlations in VLMs is very related to recent progress in VLMs. Empirical results across multiple benchmarks demonstrate certain robustness and cross-model generalization with minimal data and t
- The biggest weaknesses to me is the experimental settings. MMVP is such a small-scale dataset with only 150 images pair, and the author randomly 500 samples subset from SeedBench. The choice of experiments are hard to delivery something reliable. Besides, as the author mentioned the evaluated datasets are "everyday-scene benchmark". However, as this paper is motivated by "existing VLMs rely on spurious visual cues, conflating perception", there are datasets suitable for this purpose, such as V
1. The authors propose the VISTA framework to address shortcut learning in VLMs, which explicitly separates visual perception (sensor) from logical reasoning (reasoner) to mitigate reliance on spurious visual cues.
1. While the VISTA framework attempts to address shortcut learning by employing a dual-agent architecture (VLM + LLM), this approach does not fundamentally solve the underlying issue within the VLM itself. The VLM component remains susceptible to shortcut learning, merely transferring rather than resolving this critical limitation. 2. The evaluation is currently limited to established benchmarks. To better demonstrate the method's robustness and generalizability, performance should be validated
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
