Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes
Michael Bereket, Jure Leskovec

TL;DR
This paper investigates the effectiveness of reinforcement learning methods in stochastic domains for language models, revealing that Group Relative Policy Optimization (GRPO) causes overconfidence, unlike other methods like PPO and RLOO.
Contribution
The study identifies that standard normalization in GRPO leads to overconfidence in stochastic outcomes and offers a theoretical explanation for this phenomenon.
Findings
GRPO induces overconfidence in stochastic outcomes
Removing normalization in GRPO improves calibration
PPO and RLOO produce well-calibrated models in stochastic domains
Abstract
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with stochastic outcomes, like scientific experiments. Through applications to synthetic data and real-world biological experiments, we demonstrate that Group Relative Policy Optimization (GRPO) induces overconfident probability predictions for binary stochastic outcomes, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why normalization causes overconfidence. Our results provide new evidence against the use of standard normalization in…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper clearly identifies and isolates the role of group standard normalization in GRPO, providing both empirical and theoretical evidence that this design choice induces systematic overconfidence in stochastic decision settings. - The experimental evaluation spans synthetic, biological, and clinical knowledge domains, demonstrating that the observed miscalibration behavior persists across qualitatively different tasks and data regimes. - The theoretical explanation of how standard normal
- The empirical evaluation relies only on a single model (Qwen3-4B) across all experiments, which makes it difficult to determine whether the observed calibration differences generalize beyond this specific architecture and scale. Including additional models would substantially strengthen the empirical claims. Furthermore, some details of the experimental setup are under-specified in the main text. - The model is required to generate explicit natural language tokens to represent probability, r
- This paper is well-written and easy to follow. The motivation is well-supported by evidence, including visualizations of the miscalibration and explanations. - The experimental setup is clear. Although the experiments are not large-scale RL, they cover 3 different datasets and clearly demonstrate the overconfidence phenomenon across different stochastic settings.
While I appreciate that the authors provide a new perspective on the impact of group std normalization from the lens of uncertainty and overconfidence, the theoretical discussion itself does not bring substantial new insights. Similar ideas, namely that group std normalization can lead to overconfidence, and the corresponding solution of removing the term have also been discussed in [1]. Furthermore, the role of group std normalization has been extensively discussed in prior work, such as Dr. GR
1. Interesting empirical observation. The finding that GRPO causes overconfidence in stochastic prediction tasks is novel and may interest researchers exploring calibration or uncertainty in RL algorithms. 2. Practical implication. The fix—removing standard normalization—is simple and easy to test. If correct, it could inform best practices for future reasoning RL work. 3. Link theory and practice. The discussion about group normalization introducing a bias in the advantage estimate points tow
1. Limited originality and contribution. The main claim—removing normalization from GRPO improves behavior—closely parallels Dr. GRPO [1], which already proposed removing the same term. The paper essentially re-validates known ideas rather than introducing a distinct algorithmic or theoretical innovation. Calibration of GRPO is certainly a bug worth fixing, but the contribution is incremental—basically a note explaining why an already known fix works—rather than delivering a substantial new meth
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making · Cognitive Science and Mapping
