Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
Tao Wang, Shuo Li, Yan Sun, Dongsheng Ding, Edgar Dobriban

TL;DR
This paper introduces HORA, a novel, learning-free method for optimizing rollout allocation in group-based RLVR, which improves efficiency and performance over fixed strategies across various benchmarks.
Contribution
HORA maximizes posterior hit utility for rollout allocation, offering a more efficient, adaptive approach that outperforms uniform allocation in multiple reasoning benchmarks.
Findings
HORA improves Pass@K over GRPO in most configurations.
HORA is compatible with other group-based estimators like RLOO.
Uniform prior in HORA is competitive with learned priors.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
