BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
This paper introduces BAPO, a novel adaptive clipping method for off-policy reinforcement learning in large language models, which stabilizes training, maintains entropy, and improves data efficiency and performance.
Contribution
The paper identifies key issues in off-policy RL for LLMs and proposes BAPO, a dynamic clipping technique that enhances stability and exploration during training.
Findings
BAPO outperforms existing open-source models on AIME benchmarks.
BAPO achieves state-of-the-art results among models of similar scale.
BAPO demonstrates improved stability and data efficiency in diverse off-policy scenarios.
Abstract
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced…
Peer Reviews
Decision·ICLR 2026 Poster
* This work provides a sensible and relevant analysis of issues commonly encountered in practice. * The proposed method is reasonable, practical, and straightforward to implement. * Strong model performance on AIME 2024 / 2025 benchmarks.
- **Limited evaluation scope.** Evaluation are restricted to AIME 2024/2025, each containing only 30 problems. Broader evaluation on more diverse and widely used benchmarks would strengthen the claims. - **Attribution of improvements is unclear.** Comparisons are made with prior models trained under substantially different setups, notably in terms of training data. This paper lacks detail on the SFT stage (before RL), making it difficult to determine whether performance gains of the final mode
* Entropy clipping rule is a good contribution * the result shows strong improvement
* hyper parameter complexity * eval task diversity is very limited (only math)
1. The empirical performance with an asymmetric clip constant is promising. 2. Some intuition-level justification from the policy gradient contribution perspective and the entropy dynamic perspective is given to pave the motivation. 3. The paper is quite well-written and easy to follow.
I'm somewhat skeptical of both perspectives on the training instability discussed in the paper: 1. From the policy gradient perspective, e.g., Eq. (3), in general RL, both positive and negative samples are important in estimating the value function gradient. Taking out either part will result in biased estimation and harm the performance improvement. On the other hand, if a random minibatch contains only negative samples, the policy can still be improved in expectation. So, the dominance of neg
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques
