Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

TL;DR
PAPO is a novel reinforcement learning algorithm that enhances multimodal reasoning by integrating perception-aware supervision, leading to significant performance improvements and reduced perception errors in visual reasoning tasks.
Contribution
It introduces the Implicit Perception Loss and Double Entropy Loss to improve perception and reasoning in multimodal RL without extra data or models.
Findings
Achieves 4.4%-17.5% improvements on multimodal benchmarks.
Reduces perception errors by 30.5%.
Enhances performance on vision-dependent tasks by 8.0%-19.1%.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-3Bmodel· 35 dl· ♡ 135 dl♡ 1
- 🤗PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-7Bmodel· 16 dl· ♡ 216 dl♡ 2
- 🤗PAPOGalaxy/PAPO-D-Qwen2.5-VL-3Bmodel· 52 dl52 dl
- 🤗PAPOGalaxy/PAPO-D-Qwen2.5-VL-7Bmodel· 368 dl368 dl
- 🤗PAPO-Galaxy/PAPO-D-Qwen2.5-VL-7Bmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗PAPO-Galaxy/PAPO-G-H-Qwen2.5-VL-3Bmodel· 2 dl2 dl
- 🤗PAPO-Galaxy/PAPO-G-H-Qwen2.5-VL-7Bmodel· 12 dl12 dl
- 🤗xuehang/PAPO-D-Qwen2.5-VL-7Bmodel
- 🤗xuehang/PAPO-D-Qwen2.5-VL-3Bmodel
- 🤗xuehang/PAPO-G-H-Qwen2.5-VL-3Bmodel
Videos
