Efficient RL Training for LLMs with Experience Replay

Charles Arnal; Vivien Cabannes; Taco Cohen; Julia Kempe; Remi Munos

arXiv:2604.08706·cs.LG·April 13, 2026

Efficient RL Training for LLMs with Experience Replay

Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Remi Munos

PDF

TL;DR

This paper investigates the use of experience replay buffers in large language model post-training, demonstrating that well-designed replay strategies can reduce inference costs without sacrificing performance.

Contribution

It challenges the belief that on-policy data is necessary for LLM training, formalizes replay buffer design trade-offs, and empirically shows efficiency gains.

Findings

01

Replay buffers can drastically reduce inference compute.

02

Well-designed replay buffers can maintain or improve model performance.

03

Strict on-policy sampling is suboptimal when generation is expensive.

Abstract

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.