Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Ziqi Miao; Haonan Jia; Lijun Li; Chen Qian; Yuan Xiong; Wenting Yan; and Jing Shao

arXiv:2603.28618·cs.AI·April 10, 2026

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, and Jing Shao

PDF

TL;DR

PRCO introduces a dual-role reinforcement learning framework for multimodal reasoning, improving visual evidence extraction and reasoning accuracy by coordinating an Observer and Solver with role-specific rewards.

Contribution

It proposes PRCO, a novel perception-reasoning coevolution approach with shared policy and role-specific rewards, enhancing multimodal reasoning performance.

Findings

01

PRCO improves accuracy by over 7 points across benchmarks.

02

PRCO outperforms prior open-source RL-tuned baselines.

03

PRCO demonstrates consistent gains across model scales.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.