DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
Haoyu Hu,Xuandong Zhao,Xuhai "Orson'' Xu,Nori Jacoby

TL;DR
DUET is a novel method that jointly optimizes prompt and rollout length decisions in reinforcement learning with verifiable rewards, leading to faster training and improved reasoning quality.
Contribution
It introduces a computationally efficient layer over GRPO that adaptively allocates token budgets, outperforming baselines across multiple benchmarks.
Findings
DUET outperforms full-budget GRPO and baselines on Qwen3-1.7B.
Using only 50% of tokens, DUET surpasses all full-budget methods.
DUET achieves 1.62x speedup and maintains high performance under tight budgets.
Abstract
Reinforcement learning with verifiable rewards (RLVR) generates hundreds of thousands of tokens per training step, with rollout generation dominating the computational cost. The overall token budget can be controlled along two main dimensions: (i) deciding which prompts to allocate rollouts to, and (ii) deciding how long each rollout should be. Prior work has generally controlled only one of these dimensions at a time. We show that jointly tuning both decisions under a shared compute budget improves both reasoning quality and wall-clock training time. We instantiate this view as \textbf{DU}al-controlled tok\textbf{E}n alloca\textbf{T}ion (DUET), a computationally efficient layer over GRPO that uses a lightweight pre-rollout surrogate of prompt informativeness to set how many rollouts each prompt receives, and a marker-gated abort rule with importance reweighting to set when to stop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
