Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin

TL;DR
This paper introduces a reinforcement learning framework that enhances vision-language models by explicitly rewarding perception fidelity, effectively addressing the perception-reasoning trade-off and improving performance across diverse tasks.
Contribution
It proposes a novel perception verification method and a structured verbal verification technique, integrated into a modality-aware credit assignment mechanism for better perception-reasoning synergy.
Findings
Improved perception and reasoning performance across multiple vision-language tasks.
Effective decoupling of perception and reasoning steps for targeted supervision.
Enhanced reward routing to address perception or reasoning errors specifically.
Abstract
Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
