Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards
Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, Viet Anh Nguyen

TL;DR
This paper introduces VIP, a variance-informed adaptive rollout allocation method for online reinforcement learning with verifiable rewards, improving sampling efficiency and training performance.
Contribution
We propose VIP, a novel adaptive allocation strategy using Gaussian process predictions to minimize gradient variance, enhancing efficiency over fixed or heuristic methods.
Findings
VIP outperforms uniform allocation in benchmarks.
VIP achieves higher policy performance with fewer rollouts.
The method effectively reduces gradient variance during training.
Abstract
Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations…
Peer Reviews
Decision·ICLR 2026 Poster
he paper derives per-prompt gradient-variance formulas for Dr.GRPO and RLOO, motivating variance-aware allocation rather than uniform rollouts. On AIME24/25 and tool-augmented retrieval, VIP consistently improves accuracy/quality over uniform allocation while keeping the same total number of rollouts. Replacing either the GP predictor or the optimizer degrades performance, suggesting both components contribute meaningfully.
Results are reported under the same rollout budget; however, adaptive per-prompt rollout counts can change the number of optimizer steps and total time. The paper should add (a) wall-clock time vs. accuracy and (b) equal-step comparisons to isolate compute-efficiency. Training a GP and solving the allocation each iteration adds overhead; the paper does not provide detailed runtime/memory profiling vs. uniform baselines.
- The paper addresses a relevant and timely problem in reinforcement learning for language models, focusing on efficient rollout allocation under limited budgets. - The proposed VIP framework is clearly motivated and presented as a practical enhancement to existing RL with verifiable rewards (RLVR) methods. - The topic is well aligned with current interest in improving training efficiency for large models. - The proposed idea of allocating samples based on variance estimates, where the unknown d
The theoretical analysis is developed under very restrictive assumptions. It sometimes feels as though several of the key challenges of the original setting have been simplified away in order to make the variance computations tractable. While this makes the analysis cleaner, it also raises concerns about the realism and relevance of the resulting conclusions and the potential for counterintuitive sample allocations. Concern about Assumption 3.1: The assumption $\pi_{old} = \pi_\theta$ effective
- The motivation for the rollout budget cost and performance tradeoff is well-explained. - The proposed method can be integrated with off-the-shelf learning algorithms, and the experiment results show an improvement in the result. - The parameter update for adaptive allocation can be done in an online learning approach using Bayesian updates, and does not seem to require huge computation when new data is added.
- One concern I have is whether the Bayesian estimation results in biased estimation of the success probability. If I understand correctly, the probability $p$ measures the probability that the outcome sampled on some query results in a positive reward. This probability should depend on the policy's sentence generation probability, and given that the policy is updated during the training phase, the algorithm should be able to handle the non-stationarity of the reward. However, this discussion is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
