TL;DR
This paper investigates why large language models often appear overconfident despite lacking actual competence, revealing a two-system architecture with a complex assessment process and a simpler execution mechanism.
Contribution
It introduces a geometric analysis of LLM internal states, showing the decoupling between confidence and competence, and highlights the limited control of confidence via linear interventions.
Findings
Decodable belief axis generalizes across tasks and models.
Assessment manifold has high linear effective dimensionality.
Execution evolves on a low-dimensional manifold, explaining the confidence gap.
Abstract
Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal "solvability belief" of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper tackles a timely and important research topic, presenting a thorough and well-structured investigation. The proposed hypothesis provides a reasonable explanation for the observed results. - The writing has a clear narrative flow, and the authors make a commendable effort to build their central claim through a series of experiments. This storytelling approach helps readers follow the logical progression of the study and understand how each experiment contributes to the overall argume
- From Figure 1, I’m not fully convinced that there is a clear “climbing through” pattern across the model’s layers. The results appear rather noisy to me. Could the authors provide additional evidence or analysis to support this claim? - Many of the subsequent arguments rely on the initial claim in Section 3.2 — that linear probing accurately reflects an LLM’s internal beliefs. However, an accuracy of 70–75% seems insufficient to make this claim convincing. Is there a systematic way to validat
1) The paper tackles a important question for AI safety and deployment. Understanding what drives the confidence-competence gaps has consequences for safety (models that are confidently wrong could be deployed in more dangerous ways than those that appropriately communicate their own uncertainty) and this has obvious ramifications for model trust and deployment. Most of the work (that I know of) thinking about this question focuses on model outputs, and it is valuable whether we can understand s
1) The paper argues that it studies the model's internal "solvability belief." But the labels used to train the probes are the model's own zero-shot performance (Section 3.1). This label is misaligned ---- the probe is actually being trained to predict whether the model succeeds, not whether it believes it will succeed. As an example, the probe could therefore be learning things about the problem difficulty, heuristic markers of solvability that emerge early on the network. I do not see how the
- The overall structure of the paper is clear and logically coherent. The authors introduce their research question in a focused way and progressively build their arguments. - Methodologically, the study follows the established paradigm of mechanistic interpretability, employing linear probes to decompose representations into interpretable components. By linearly separating correct and incorrect belief states with non-trivial reasoning, and further visualizing their non-overlapping distributions
- Figure 3 lacks sufficient clarity. The first step involves constructing linear probes for each layer and each problem; however, it is unclear which layer’s hidden states were used to generate the visualization in this figure. Since four different models were tested, it would be informative to specify whether the trends were consistent across models, and which specific model’s results are displayed in Figure 3. Moreover, if the figure combines hidden states from all layers across over 800
1. The paper is very well written. The authors provide clear and good concept definitions and motivations. The research question is well defined, and the overall presentation makes the paper easy and enjoyable to read. 2. The research problem itself is intriguing and of broad relevance. The work offers insights that can benefit not only researchers studying reasoning capabilities in large language models but also those in AI safety, mechanistic interpretability, and efficiency. The conclusions
W1. Missing Appendix. Line 377 mentioned the reverse causal intervention study (solved -> unsovled) in Appendix C, but the Appendix is not attached to the main paper. W2. Lack of in-response confidence modelling. When humans solve challenging problems, we often begin uncertain about a problem’s solvability. Instead, we gradually build confidence or unsolvability-awareness through iterative attempts and self-correction. I believe this dynamic evolution of confidence supports the vision of tes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
