$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts
Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

TL;DR
This paper introduces $V_{0.5}$, a novel value baseline for sparse reinforcement learning that adaptively combines a pre-trained value model with empirical data, improving stability and efficiency in policy gradient estimation.
Contribution
The paper proposes $V_{0.5}$, a new adaptive value baseline that fuses prior knowledge with sparse rollout data, using real-time statistical testing to optimize policy gradient stability.
Findings
$V_{0.5}$ outperforms GRPO and DAPO in six benchmarks.
Achieves faster convergence and over 10% performance gains.
Maintains low variance and stable gradients under extreme sparsity.
Abstract
In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as ), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose , which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
