LLMs Can Learn to Reason Via Off-Policy RL
Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kiant\'e Brantley, Wen Sun

TL;DR
This paper introduces OAPL, a novel off-policy RL algorithm for LLMs that handles policy lag effectively, outperforming existing methods on benchmarks and enabling efficient post-training with significant off-policy data.
Contribution
The paper proposes OAPL, an off-policy RL algorithm that does not require policy alignment modifications, effectively managing large policy lags in training and inference.
Findings
OAPL outperforms GRPO with importance sampling on math benchmarks.
OAPL matches DeepCoder's performance on LiveCodeBench with fewer generations.
Models trained with OAPL show improved test time scaling under Pass@k.
Abstract
Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference policies by explicitly modifying the inference engine. In this work, we embrace off-policyness and propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL). We show that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning in Healthcare
