Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Xiaodong Lu; Xiaohan Wang; Jiajun Chai; Guojun Yin; Wei Lin; Zhijun Chen; Yu Luo; Fuzhen Zhuang; Yikun Ban; Deqing Wang

arXiv:2602.08499·cs.LG·February 10, 2026

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

PDF

Open Access

TL;DR

This paper introduces a neural scheduling framework for reinforcement learning with verifiable rewards, treating rollout selection as a contextual bandit problem to improve efficiency and performance in language model training.

Contribution

It formulates rollout scheduling as a contextual bandit problem and proposes a neural framework for adaptive selection, enhancing RLVR training effectiveness.

Findings

01

Consistent performance improvements across six reasoning benchmarks.

02

Enhanced training efficiency with adaptive rollout reuse.

03

Theoretical guarantees with sublinear regret bounds.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Topic Modeling