Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Wu Fei; Hao Kong; Shuxian Liang; Yang Lin; Yibo Yang; Jing Tang; Lei Chen; Xiansheng Hua

arXiv:2507.01551·cs.LG·July 4, 2025

Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua

PDF

3 Reviews

TL;DR

This paper introduces SPRO, a process reward optimization framework for Process Reinforcement Learning that derives intrinsic rewards and step-wise advantages, improving training efficiency and performance without extra computational costs.

Contribution

SPRO provides a unified theoretical framework for process-level advantage estimation and intrinsic reward derivation, enhancing efficiency and stability in process RL for LLMs.

Findings

01

SPRO achieves 3.4x higher training efficiency than vanilla GRPO.

02

SPRO improves test accuracy by 17.5%.

03

SPRO reduces response length by about one-third.

Abstract

Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper clearly identifies the computational inefficiency of PRM-based methods and proposes a conceptually elegant self-guided alternative grounded in theory (policy-as-Q-function perspective). 2. The derivation connecting token-level MDPs, policy logits, and process rewards (Eq. 1–5) is rigorous and builds well on prior implicit reward literature (e.g., DPO/PRIME). 3. The presentation of CPR + MSA, especially Fig. 2–3, offers an intuitive and well-structured comparison with GRPO/PRIME. SPR

Weaknesses

1. Results are restricted to 7B-scale models; scalability claims to larger models reasoning are not empirically validated. 2. While the “policy-as-reward-model” proposition is appealing, it risks circular reasoning—reward quality depends on policy quality, which itself evolves via those rewards. 3. Baselinses are mainly include GRPO and PRIME; missing other RL methods (e.g., Rest-MCTS*, Reinforce++, DAPO) weakens generality claims.

Reviewer 02Rating 2Confidence 3

Strengths

SPRO avoids an auxiliary PRM and computes process feedback directly from the policy and a fixed reference model, preserving a dual-model training footprint. MSA enforces per-step comparisons within a prompt’s rollouts, which directly targets the length-bias problem common in outcome-only grouping. The construction and “masked mean” baseline are clearly spelled out. Simulation gains include shorter trajectories, higher accuracy, and better entropy than PRIME/GRPO under their setup, with tables/

Weaknesses

The key identity used to define CPR (Eq. 2–5) is derived by combining the max-entropy fixed-point and a Bellman-style relation that holds for the optimal policy/value. In practice, SPRO replaces $\pi^*$ with the current $\pi_\theta$ to compute the log-ratio sum. The paper does not quantify the bias introduced by this substitution nor provide a bound that links CPR to true per-step advantage under model mismatch. This is central because CPR then drives the update. Proposition 1 asserts that “an

Reviewer 03Rating 6Confidence 2

Strengths

1. The paper clearly articulates its contributions and presents a coherent logical progression. It first introduces the CPR concept, builds upon it to propose the MSA module, and finally develops the overall SPRO algorithmic framework. The overall structure is well-organized, and the presentation is smooth and easy to follow. 2. The experimental design convincingly demonstrates that the proposed SPRO algorithm outperforms prior baselines in terms of both accuracy and training efficiency. 3.

Weaknesses

1. One of the main contributions of this paper is the introduction of the self-guided reward (CPR), which is theoretically formulated as a general reward mechanism that appears not to be constrained by reasoning tasks or step-wise processes. However, the experiments only evaluate CPR within reasoning-oriented benchmarks. It would significantly strengthen the paper to include results demonstrating CPR’s generality—specifically, whether it can serve as a replacement for reward models in other comm

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.