Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

Zhiyuan Zhai; Xin Wang

arXiv:2605.05802·cs.LG·May 8, 2026

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

Zhiyuan Zhai, Xin Wang

PDF

1 Repo

TL;DR

This paper introduces a method to early-stop low-variance groups in multi-sample RL training, reducing wasted computation and improving training efficiency and success rates.

Contribution

It proposes a prefix divergence-based gating mechanism to predict zero-variance groups during training, enabling early termination and resource savings.

Findings

01

10.7% faster training in wall-clock time

02

+2.5 percentage points on success rate on unseen tasks

03

Reduced zero-advantage gradient batch dilution

Abstract

Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environments each rollout is a long multi-turn dialogue with one LLM call per step, so this multi-sample multiplier dominates the total training cost. When every rollout of a prompt ends with the same reward, the group has zero reward variance and contributes no gradient, so the extra rollouts add no information; such groups are common in practice (typically around 40% of all groups), so the wasted-compute fraction is substantial rather than marginal. Existing methods filter such groups at the prompt level, either after their rollouts are paid for or before any rollout begins, but both decide without using information that becomes available during the rollout itself. We instead ask whether the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhiyuanZhai20/selective-rollout
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.