BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi; Xin Guo; Yang Nan; Enyu Zhou; Junrui Shen; Wenxiang Chen; Jiaqi Liu; Jixuan Huang; Zhihao Zhang; Honglin Guo; Xun Deng; Zhikai Lei; Miao Zheng; Guoteng Wang; Shuo Zhang; Peng Sun; Rui Zheng; Hang Yan; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2510.18927·cs.LG·October 23, 2025

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces BAPO, a novel adaptive clipping method for off-policy reinforcement learning in large language models, which stabilizes training, maintains entropy, and improves data efficiency and performance.

Contribution

The paper identifies key issues in off-policy RL for LLMs and proposes BAPO, a dynamic clipping technique that enhances stability and exploration during training.

Findings

01

BAPO outperforms existing open-source models on AIME benchmarks.

02

BAPO achieves state-of-the-art results among models of similar scale.

03

BAPO demonstrates improved stability and data efficiency in diverse off-policy scenarios.

Abstract

Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

* This work provides a sensible and relevant analysis of issues commonly encountered in practice. * The proposed method is reasonable, practical, and straightforward to implement. * Strong model performance on AIME 2024 / 2025 benchmarks.

Weaknesses

- **Limited evaluation scope.** Evaluation are restricted to AIME 2024/2025, each containing only 30 problems. Broader evaluation on more diverse and widely used benchmarks would strengthen the claims. - **Attribution of improvements is unclear.** Comparisons are made with prior models trained under substantially different setups, notably in terms of training data. This paper lacks detail on the SFT stage (before RL), making it difficult to determine whether performance gains of the final mode

Reviewer 02Rating 6Confidence 4

Strengths

* Entropy clipping rule is a good contribution * the result shows strong improvement

Weaknesses

* hyper parameter complexity * eval task diversity is very limited (only math)

Reviewer 03Rating 4Confidence 4

Strengths

1. The empirical performance with an asymmetric clip constant is promising. 2. Some intuition-level justification from the policy gradient contribution perspective and the entropy dynamic perspective is given to pave the motivation. 3. The paper is quite well-written and easy to follow.

Weaknesses

I'm somewhat skeptical of both perspectives on the training instability discussed in the paper: 1. From the policy gradient perspective, e.g., Eq. (3), in general RL, both positive and negative samples are important in estimating the value function gradient. Taking out either part will result in biased estimation and harm the performance improvement. On the other hand, if a random minibatch contains only negative samples, the policy can still be improved in expectation. So, the dominance of neg

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques