Calibration-Aware Policy Optimization for Reasoning LLMs
Ziqi Wang, Xingzhou Lou, Meiqi Wu, Zhengqi Wen, Junge Zhang

TL;DR
This paper introduces CAPO, a calibration-aware policy optimization method for reasoning LLMs that improves calibration without sacrificing accuracy, addressing overconfidence issues in existing approaches.
Contribution
It proposes a novel uncertainty-aware advantage estimation technique with a logistic AUC surrogate loss and noise masking, enhancing calibration and accuracy simultaneously.
Findings
CAPO improves calibration by up to 15% on reasoning benchmarks.
CAPO maintains or improves reasoning accuracy compared to GRPO.
CAPO achieves a Pareto-optimal trade-off in precision and coverage when abstaining.
Abstract
Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
