PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

Ruiyi Ding; Yongxuan Lv; Xianhui Meng; Jiahe Song; Chao Wang; Chen Jiang; Yuan Cheng

arXiv:2601.07182·cs.LG·February 4, 2026

PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

Ruiyi Ding, Yongxuan Lv, Xianhui Meng, Jiahe Song, Chao Wang, Chen Jiang, Yuan Cheng

PDF

Open Access

TL;DR

PRPO enhances policy optimization for large language models by integrating process-level guidance with outcome rewards, leading to improved reasoning accuracy without requiring a value network.

Contribution

It introduces a critic-free method that combines process reward models with outcome rewards through normalization and distribution alignment.

Findings

01

PRPO improves accuracy from 61.2% to 64.4% on MATH500.

02

It achieves this with only eight rollouts and no value network.

03

Demonstrates efficient fine-grained credit assignment in policy optimization.

Abstract

Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Reinforcement Learning in Robotics