Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan; Yansheng Wang; Wenqi Huang; Mingyang Sun

arXiv:2602.20722·cs.AI·March 17, 2026

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun

PDF

Open Access 3 Reviews

TL;DR

This paper introduces BAPO, an off-policy reinforcement learning framework that enhances data efficiency and problem-solving ability in large language models by selectively reusing difficult and high-quality samples.

Contribution

BAPO is a novel off-policy RLVR method that dynamically re-evaluates and reuses training samples, improving learning efficiency and problem-solving in large language models.

Findings

01

BAPO achieves 12.5% average improvement over GRPO.

02

BAPO resolves 40.7% of previously unsolvable problems.

03

BAPO enhances learning across mathematics, planning, and visual reasoning tasks.

Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The ablation study is comprehensive, including ablations on $\mathcal{X}_2, \mathcal{X}_3$, delay steps (v), re-rollout frequency (m), and difficulty thresholds $c_1,c_2,c_3$. - Experiments span multiple reasoning domains, demonstrating versatility. - The paper structure is clear.

Weaknesses

- Overall, the method appears to be a collection of empirical tricks rather than a broadly generalizable or methodologically novel approach. Specifically, The method introduces a large number of hyperparameters and custom design choices—such as $c_1, c_2, c_3$, online sample mean ($\mu$) and standard deviation ($\sigma$), re-rollout frequency (m), linear mapping function, rollout delay steps (v), and the proportion of different sample types, which greatly limit its plug-and-play usability. - Th

Reviewer 02Rating 6Confidence 4

Strengths

- The paper clearly identifies the key limitations of existing on-policy RLVR methods (experience waste and reward homogeneity) and proposes a well-motivated off-policy framework with intuitive design principles that align with RL fundamentals. - The paper provides rigorous theoretical analysis (Theorem 3.2) proving that the adaptive batch construction mechanism maintains a lower bound guarantee for policy improvement, ensuring training stability while leveraging off-policy data. - The evaluatio

Weaknesses

- While BAPO effectively improves learning efficiency through adaptive sample selection, it primarily reorganizes existing experiences rather than fundamentally expanding the model's exploration space. This may limit its ability to solve problems that consistently fail under the current policy distribution. - The paper primarily compares rollout counts, but lacks detailed analysis of the actual computational overhead, particularly the costs of forward passes (e.g., computing log probabilities fo

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper proposes a method that systematically integrates the concept of off-policy reinforcement learning into the reinforcement learning with verifiable rewards for large language models. The method establishes a learning framework different from the traditional on-policy training paradigm by introducing experience replay and importance sampling mechanisms. 2. The core innovation of the method lies in the design of a dynamic re-evaluation mechanism, which can reassess historical data acc

Weaknesses

1. **On the necessity of the Gaussian sampling in the $\mathcal{X}_1$ module.** The paper does not validate the performance impact of removing Gaussian sampling through ablation experiments, nor does it compare other sampling strategies (such as uniform sampling or difficulty-based sampling). Readers cannot determine whether this module is indispensable or merely a redundant design that increases methodological complexity. Therefore, it is recommended to supplement ablation experiments on Gaussi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling