Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma; Xueru Wen; Boxi Cao; Yaojie Lu; Hongyu Lin; Jinglin Yang; Min He; Xianpei Han; Le Sun

arXiv:2603.09117·cs.LG·May 1, 2026

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

PDF

1 Repo

TL;DR

This paper introduces DCPO, a framework that decouples reasoning and calibration in reinforcement learning for large language models, improving calibration without sacrificing accuracy.

Contribution

It reveals a gradient conflict in existing methods and proposes a decoupling approach that enhances calibration while maintaining accuracy.

Findings

01

DCPO achieves superior calibration performance.

02

DCPO preserves accuracy comparable to existing methods.

03

It substantially reduces over-confidence in LLMs.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

icip-cas/DCPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.