BNPO: Beta Normalization Policy Optimization

Changyi Xiao; Mengdi Zhang; Yixin Cao

arXiv:2506.02864·cs.LG·June 4, 2025

BNPO: Beta Normalization Policy Optimization

Changyi Xiao, Mengdi Zhang, Yixin Cao

PDF

Open Access 3 Reviews

TL;DR

BNPO introduces an adaptive reward normalization technique using a Beta distribution to improve the stability and performance of policy optimization in reinforcement learning for reasoning tasks.

Contribution

The paper proposes BNPO, a novel reward normalization method that adaptively aligns with policy updates, enhancing gradient estimation and training stability in RL.

Findings

01

BNPO achieves state-of-the-art results on reasoning tasks.

02

BNPO reduces gradient variance compared to traditional methods.

03

Theoretical analysis confirms BNPO's variance-reducing properties.

Abstract

Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

The paper addresses a practical problem aimed at improving the stability of policy gradient by reducing gradient variance. Meanwhile, it provides an elegant theoretical derivation of the algorithm.

Weaknesses

One of the baselines, ReMax, already provides a straightforward method for computing a baseline and stabilizes the training as shown in Figure 2. The proposed method requires the estimation of two parameters, which can introduce extra estimation biases and variances. The empirical testing is limited to four MATH benchmarks and Qwen base models only. Also, experimental analysis on the advantage decomposition mechanism is missing. Li, Z., Xu, T., Zhang, Y., Lin, Z., Yu, Y., Sun, R., & Luo, Z. Q.

Reviewer 02Rating 6Confidence 2

Strengths

1. BNPO introduces a dynamic reward normalization technique using the Beta distribution, offering a novel solution to reward normalization issues in reinforcement learning. This is particularly important as existing methods use static normalization, which doesn’t adapt as training progresses. 2. The paper provides a solid theoretical analysis showing that BNPO effectively reduces gradient variance and enhances the stability of training. The detailed proof and derivations offer a clear understand

Weaknesses

1. While BNPO works well for binary-valued rewards, the extension to multi-valued or continuous rewards is mentioned but not thoroughly explored in the paper. More detailed analysis and experiments on this extension would strengthen the claim of BNPO's general applicability. 2. The adaptive nature of BNPO, while theoretically sound, may introduce additional computational overhead in estimating the parameters of the Beta distribution during training. The paper could discuss the trade-off between

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper proposes a complete workflow for normalizing the reward for policy gradient with a Beta distribution, including estimating the distribution parameters. 2. It is interesting that with a different parameter setup, Beta normalization recovers REINFORCE and GRPO. 3. The proposed approach is empirically evaluated on multiple datasets and multiple models, with both pass rate and gradient norm.

Weaknesses

The conceptual aspect of the proposed method, in many aspects, is somewhat flawed, which limits the method in general RL problems. 1. The correctness of the gradient estimator is questionable. After the new normalization, the gradient estimator is no longer unbiased to the exact policy gradient. There is no conceptual justification that the new estimator would achieve a solution close to the optimal policy of the original problem for general problems. 2. The theorem in this paper rests on an

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods · Monetary Policy and Economic Impact