Calibration-Aware Policy Optimization for Reasoning LLMs

Ziqi Wang; Xingzhou Lou; Meiqi Wu; Zhengqi Wen; Junge Zhang

arXiv:2604.12632·cs.LG·April 15, 2026

Calibration-Aware Policy Optimization for Reasoning LLMs

Ziqi Wang, Xingzhou Lou, Meiqi Wu, Zhengqi Wen, Junge Zhang

PDF

TL;DR

This paper introduces CAPO, a calibration-aware policy optimization method for reasoning LLMs that improves calibration without sacrificing accuracy, addressing overconfidence issues in existing approaches.

Contribution

It proposes a novel uncertainty-aware advantage estimation technique with a logistic AUC surrogate loss and noise masking, enhancing calibration and accuracy simultaneously.

Findings

01

CAPO improves calibration by up to 15% on reasoning benchmarks.

02

CAPO maintains or improves reasoning accuracy compared to GRPO.

03

CAPO achieves a Pareto-optimal trade-off in precision and coverage when abstaining.

Abstract

Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.