Process Supervision of Confidence Margin for Calibrated LLM Reasoning
Liaoyaqi Wang, Chunsheng Zuo, William Jurayj, Benjamin Van Durme, Anqi Liu

TL;DR
This paper introduces RLCM, a reinforcement learning framework that improves large language model calibration by widening confidence margins, leading to more reliable reasoning and efficient risk control.
Contribution
RLCM is a novel calibration-aware RL method that enhances confidence reliability and accuracy in LLM reasoning across multiple benchmarks.
Findings
RLCM significantly improves model calibration on various benchmarks.
Models with RLCM enable more effective conformal risk control.
RLCM maintains or enhances reasoning accuracy while calibrating confidence.
Abstract
Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
