Closing the Confidence-Faithfulness Gap in Large Language Models
Miranda Muqing Miao, Lyle Ungar

TL;DR
This paper analyzes how large language models' confidence scores are linearly encoded yet often misaligned with actual accuracy, and proposes a steering method to improve calibration.
Contribution
It reveals the linear encoding of confidence signals, introduces the Reasoning Contamination Effect, and presents a two-stage steering pipeline to enhance confidence calibration.
Findings
Confidence signals are linearly encoded but orthogonal to accuracy.
Reasoning process disrupts confidence calibration, called the 'Reasoning Contamination Effect'.
The proposed steering pipeline significantly improves calibration across models.
Abstract
Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
