$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang; Yueqing Sun; Hongyan Hao; Qi Gu; Xunliang Cai; De-Chuan Zhan; Han-Jia Ye

arXiv:2603.10848·cs.LG·March 12, 2026

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

PDF

Open Access

TL;DR

This paper introduces $V_{0.5}$, a novel value baseline for sparse reinforcement learning that adaptively combines a pre-trained value model with empirical data, improving stability and efficiency in policy gradient estimation.

Contribution

The paper proposes $V_{0.5}$, a new adaptive value baseline that fuses prior knowledge with sparse rollout data, using real-time statistical testing to optimize policy gradient stability.

Findings

01

$V_{0.5}$ outperforms GRPO and DAPO in six benchmarks.

02

Achieves faster convergence and over 10% performance gains.

03

Maintains low variance and stable gradients under extreme sparsity.

Abstract

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_{0}$ ), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$ , which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning