Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards
Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette

TL;DR
This paper introduces shrinkage-based baselines for reinforcement learning with verifiable rewards, reducing variance in policy-gradient estimates and improving training stability without extra hyperparameters.
Contribution
It proposes a novel shrinkage estimator for baseline calculation that provably lowers variance and enhances training stability in RLVR, with theoretical and empirical validation.
Findings
Shrinkage baselines outperform empirical mean baselines in variance reduction.
The proposed baseline is a drop-in replacement requiring no extra hyperparameters.
Empirical results show improved training stability and lower-variance updates.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein's paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
