Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Rohan Surana; Gagan Mundada; Xunyi Jiang; Chuhan Wang; Zhenwei Tang; Difan Jiao; Zihan Huang; Yuxin Xiong; Junda Wu; Sheldon Yu; Xintong Li; Raghav Jain; Nikki Kuang; Sizhe Zhou; Bowen Jin; Zhendong Chu; Tong Yu; Ryan Rossi; Kuan-Hao Huang; Jingbo Shang; Jiawei Han; Julian McAuley

arXiv:2605.02913·cs.LG·May 6, 2026

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han

PDF

TL;DR

This survey systematically analyzes rollout strategies for reinforcement learning in large language models, introducing a modular framework and taxonomy to improve understanding and design of these pipelines.

Contribution

It formalizes rollout pipelines with a unified notation, proposes the GFCR lifecycle taxonomy, and synthesizes diverse methods and case studies for RL-based reasoning LLMs.

Findings

01

Introduces GFCR, a modular taxonomy for rollout strategies.

02

Synthesizes methods across various RL approaches for LLMs.

03

Provides diagnostic tools and open challenges for rollout pipeline improvement.

Abstract

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.