Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
Zheyuan Zhang, Kaiwen Shi, Han Bao, Zehong Wang, Tianyi Ma, Yanfang Ye

TL;DR
This paper introduces GCPO, a geometry-aware and calibrated uncertainty framework for policy optimization that improves post-training performance by better characterizing gradient variance and learning signals.
Contribution
It provides the first principled formulation of uncertainty signals as regulators of gradient variance, addressing gaps in entropy-based estimators with a novel geometry-aware and calibrated approach.
Findings
GCPO more accurately tracks gradient variability.
GCPO consistently improves post-training performance.
The approach offers a principled perspective for robust post-training.
Abstract
Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
