RePO: Replay-Enhanced Policy Optimization
Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu

TL;DR
RePO introduces a replay-enhanced method for policy optimization in reinforcement learning applied to large language models, significantly improving performance on mathematical reasoning tasks by utilizing diverse off-policy samples.
Contribution
This paper presents RePO, a novel replay-based policy optimization technique that enhances data efficiency and performance in RL for large language models, surpassing previous on-policy methods.
Findings
RePO achieves 18.4 and 4.1 point improvements on two models.
RePO increases computational cost by 15%.
Effective optimization steps increase by 48%.
Abstract
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of and points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by while raising the number of effective optimization steps by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
