Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

Guanning Zeng; Zhaoyi Zhou; Daman Arora; Andrea Zanette

arXiv:2511.03710·cs.LG·February 19, 2026

Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette

PDF

Open Access

TL;DR

This paper introduces shrinkage-based baselines for reinforcement learning with verifiable rewards, reducing variance in policy-gradient estimates and improving training stability without extra hyperparameters.

Contribution

It proposes a novel shrinkage estimator for baseline calculation that provably lowers variance and enhances training stability in RLVR, with theoretical and empirical validation.

Findings

01

Shrinkage baselines outperform empirical mean baselines in variance reduction.

02

The proposed baseline is a drop-in replacement requiring no extra hyperparameters.

03

Empirical results show improved training stability and lower-variance updates.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein's paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)