On-Policy RL with Optimal Reward Baseline
Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei

TL;DR
OPO is a new on-policy reinforcement learning algorithm that stabilizes training, improves exploration, and enhances diversity in large language models without auxiliary models or regularization.
Contribution
It introduces an exact on-policy training method with an optimal reward baseline to reduce variance and improve stability in reinforcement learning for language models.
Findings
OPO outperforms existing methods on mathematical reasoning benchmarks.
It achieves more stable training and higher output diversity.
OPO reduces policy shifts and enhances exploration.
Abstract
Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is well motivated, addressing an important research problem, RL algorithms often suffering from training instability due to loose on-policy constraints. 2. The paper is well written, the derivation of the proposed OPO algorithm is easy to follow and understand. 3. Optimal reward baseline is simple and straightforward to implement, making it easy to incorporate into various RL algorithms. 4. The empirical results seems convincing: experiments on math benchmarks show that OPO achieves
1. The evaluation is limited to math benchmarks only; it would be nice to include results on code, science, STEM, or instruction-following benchmarks as well. 2. A key assumption in the derivation of OPO is that the gradients of different tokens are approximately orthogonal and that the gradient norms follow the same distribution. There is a lack of explanation and evidence for why this assumption would hold. 3. While the authors demonstrate the improvement brought by the optimal reward baseline
I found the paper to be lacking in technical novelty. It is essentially a rebranding of various previous recipes in a new form. However, I found the interest in Variance Reduction techniques in LLM finetuning to be quite refreshing.
A Substantial Literature review has been established on variance reduction techniques, though they are good for theoretical results; incorporating baselines rarely changes anything. This is kind of reflected in the training dynamics, as shown in Figure 1. Honestly, I am a bit surprised to see that both on-policy and off-policy training have similar error plots, but I guess that's OK. I might be biased in my opinion, but I think that variance reduction techniques bring little value in RL, and thi
1. The design of the reward based on minimizing gradient variance is highly innovative. 2. OPO achieves relatively stable training curves without requiring KL penalty or entropy regularization.
1. The theoretical assumptions in the paper—that "the gradients of different tokens are approximately orthogonal" and "the norm of the gradient for each token follows the same distribution"—lack justification. There is neither empirical statistical validation nor theoretical support, casting doubt on their reliability. 2. A significant portion of current RL algorithms are already based on on-policy methods. If the authors aim to emphasize the importance of on-policy learning, it would be more ap
1. The algorithm saves the need for auxiliary models (like value networks) and explicit regularization terms (such as KL divergence or entropy bonuses), simplifying the training pipeline. 2. Experimental results show that OPO achieves improved performance on mathematical reasoning benchmarks and maintains stable training dynamics with lower policy shift and higher output entropy.
1. The formula for the optimal baseline, which weights the reward by the squared magnitude of the score function, is a very well-known result in variance reduction for policy gradient methods (REINFORCE) [1]. This considerably undermines the originality of this work. 2. Around Eq. 10, the authors make two strong and unjustified assumptions: one is the orthogonal token gradients, and the other is the same distribution of different squared token gradient norms. The first assumption hardly holds i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
MethodsALIGN
