BNPO: Beta Normalization Policy Optimization
Changyi Xiao, Mengdi Zhang, Yixin Cao

TL;DR
BNPO introduces an adaptive reward normalization technique using a Beta distribution to improve the stability and performance of policy optimization in reinforcement learning for reasoning tasks.
Contribution
The paper proposes BNPO, a novel reward normalization method that adaptively aligns with policy updates, enhancing gradient estimation and training stability in RL.
Findings
BNPO achieves state-of-the-art results on reasoning tasks.
BNPO reduces gradient variance compared to traditional methods.
Theoretical analysis confirms BNPO's variance-reducing properties.
Abstract
Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper addresses a practical problem aimed at improving the stability of policy gradient by reducing gradient variance. Meanwhile, it provides an elegant theoretical derivation of the algorithm.
One of the baselines, ReMax, already provides a straightforward method for computing a baseline and stabilizes the training as shown in Figure 2. The proposed method requires the estimation of two parameters, which can introduce extra estimation biases and variances. The empirical testing is limited to four MATH benchmarks and Qwen base models only. Also, experimental analysis on the advantage decomposition mechanism is missing. Li, Z., Xu, T., Zhang, Y., Lin, Z., Yu, Y., Sun, R., & Luo, Z. Q.
1. BNPO introduces a dynamic reward normalization technique using the Beta distribution, offering a novel solution to reward normalization issues in reinforcement learning. This is particularly important as existing methods use static normalization, which doesn’t adapt as training progresses. 2. The paper provides a solid theoretical analysis showing that BNPO effectively reduces gradient variance and enhances the stability of training. The detailed proof and derivations offer a clear understand
1. While BNPO works well for binary-valued rewards, the extension to multi-valued or continuous rewards is mentioned but not thoroughly explored in the paper. More detailed analysis and experiments on this extension would strengthen the claim of BNPO's general applicability. 2. The adaptive nature of BNPO, while theoretically sound, may introduce additional computational overhead in estimating the parameters of the Beta distribution during training. The paper could discuss the trade-off between
1. The paper proposes a complete workflow for normalizing the reward for policy gradient with a Beta distribution, including estimating the distribution parameters. 2. It is interesting that with a different parameter setup, Beta normalization recovers REINFORCE and GRPO. 3. The proposed approach is empirically evaluated on multiple datasets and multiple models, with both pass rate and gradient norm.
The conceptual aspect of the proposed method, in many aspects, is somewhat flawed, which limits the method in general RL problems. 1. The correctness of the gradient estimator is questionable. After the new normalization, the gradient estimator is no longer unbiased to the exact policy gradient. There is no conceptual justification that the new estimator would achieve a solution close to the optimal policy of the original problem for general problems. 2. The theorem in this paper rests on an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods · Monetary Policy and Economic Impact
