Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu, Dawei Yin, and Dou Shen

TL;DR
This paper introduces Prefix Sampling, a method to steer reinforcement learning agents towards an optimal pass rate regime, improving efficiency and performance in software engineering tasks.
Contribution
It proposes a novel pass-rate control technique using trajectory prefix replay, enhancing RL training stability and speed in software engineering benchmarks.
Findings
Prefix Sampling achieves high-score regimes within evaluation variability.
It delivers 2.01x and 1.55x speedups on large language models.
Ablation studies confirm replay, coverage, and control as key factors.
Abstract
Agentic reinforcement learning (RL) for software engineering spends much of its compute on stateful trajectories whose grouped binary rewards are highly skewed and weakly contrastive. We frame this as pass-rate control and show that the binary reward-side signal is strongest near a 50% rollout pass rate under four criteria: reward entropy, group-filtering survival, leave-one-out (RLOO) advantage energy under Group Relative Policy Optimization (GRPO), and success-failure pair count. We propose Prefix Sampling (PS), which replays self-generated trajectory prefixes to steer skewed groups toward this regime: successful prefixes give mostly failing groups a head start, while failing prefixes handicap mostly passing groups. Replayed states are reconstructed through the existing rollout path, and replayed tokens are masked from the loss so optimization applies only to current-policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
