From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue, Chong Peng, Donglian Qi, Fangzhen Lin, Yunfeng Yan

TL;DR
This paper introduces Visual Rationale Learning (ViRL), a new paradigm for vision-language reasoning that grounds models in visual rationales, improving transparency and accuracy by treating visual actions as core reasoning primitives.
Contribution
The paper proposes ViRL, an end-to-end reinforcement learning framework that incorporates process supervision, objective alignment, and fine-grained credit assignment for better visual reasoning.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Enhances model transparency and trustworthiness.
Effectively grounds reasoning in visual evidence.
Abstract
Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Explainable Artificial Intelligence (XAI)
