Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
Tianyi Zhao, Yinhan He, Wendy Zheng, Yujie Zhang, Chen Chen

TL;DR
This paper investigates the internal mechanisms behind overconfidence in large language models, identifying specific circuits responsible and demonstrating targeted interventions to improve calibration.
Contribution
It provides a circuit-level analysis revealing internal causes of inflated confidence and proposes interventions to mitigate overconfidence in LLMs.
Findings
Middle-to-late layers contain circuits inflating confidence.
Targeted interventions improve model calibration.
Confidence signals are concentrated in specific MLP and attention components.
Abstract
Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
