Process Supervision of Confidence Margin for Calibrated LLM Reasoning

Liaoyaqi Wang; Chunsheng Zuo; William Jurayj; Benjamin Van Durme; Anqi Liu

arXiv:2604.23333·cs.LG·April 28, 2026

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

Liaoyaqi Wang, Chunsheng Zuo, William Jurayj, Benjamin Van Durme, Anqi Liu

PDF

TL;DR

This paper introduces RLCM, a reinforcement learning framework that improves large language model calibration by widening confidence margins, leading to more reliable reasoning and efficient risk control.

Contribution

RLCM is a novel calibration-aware RL method that enhances confidence reliability and accuracy in LLM reasoning across multiple benchmarks.

Findings

01

RLCM significantly improves model calibration on various benchmarks.

02

Models with RLCM enable more effective conformal risk control.

03

RLCM maintains or enhances reasoning accuracy while calibrating confidence.

Abstract

Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.