Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang; Xuehang Guo; Sofia Stoica; Haiyang Xu; Hongru Wang; Hyeonjeong Ha; Xiusi Chen; Yangyi Chen; Ming Yan; Fei Huang; Heng Ji

arXiv:2507.06448·cs.CL·April 15, 2026

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

PDF

2 Repos 12 Models 2 Datasets 1 Video

TL;DR

PAPO is a novel reinforcement learning algorithm that enhances multimodal reasoning by integrating perception-aware supervision, leading to significant performance improvements and reduced perception errors in visual reasoning tasks.

Contribution

It introduces the Implicit Perception Loss and Double Entropy Loss to improve perception and reasoning in multimodal RL without extra data or models.

Findings

01

Achieves 4.4%-17.5% improvements on multimodal benchmarks.

02

Reduces perception errors by 30.5%.

03

Enhances performance on vision-dependent tasks by 8.0%-19.1%.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Perception-Aware Policy Optimization for Multimodal Reasoning· slideslive