When Do LLMs Admit Their Mistakes? Understanding The Role Of Model Belief In Retraction
Yuqing Yang, Robin Jia

TL;DR
This paper investigates when and why large language models admit their mistakes, revealing that model belief strongly influences retraction behavior and that fine-tuning can enhance this ability.
Contribution
The study identifies model belief as a key predictor of retraction and demonstrates causal effects and improvements through steering and fine-tuning.
Findings
Models rarely retract spontaneously despite recognizing mistakes.
Model belief predicts retraction and diverges from factual knowledge.
Fine-tuning improves the accuracy of model retraction.
Abstract
Can large language models (LLMs) admit their mistakes when they should know better? In this work, we study when and why LLMs choose to retract, i.e., spontaneously and immediately acknowledge their errors. Using model-specific testbeds, we find that while LLMs are capable of retraction, they do so only rarely, even when they can recognize their mistakes when asked in a separate interaction. We identify a reliable predictor of retraction: the model's momentary belief, as measured by a probe on its internal states that is trained to predict correctness on external datasets unrelated to retraction. A model retracts only when it "believes" its answers to be incorrect during generation; these beliefs frequently diverge from models' parametric knowledge as measured by factoid questions. Steering experiments further demonstrate that model belief causally drives retraction. In particular, when…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Regarding originality, momentary belief appears to be an interesting concept. See questions/weaknesses below so I can better judge the novelty. 2. I like the supervised fine-tuning part, which connects the idea of LLMs' internal belief.
1. Soundness needs to be improved by running experiments on larger models. The largest model studied is 8B, so I am not sure how the findings generalize to larger models. It is necessary to discuss the similarity and difference between large and smaller models in terms of your momentary belief concept. 2. I am not sure the utility of momentary belief. What are the final metrics of measuring the utility of momentary belief? The clarity needs to be improved. 3. I am not sure the efficiency of co
- Clear, focused problem: Spontaneous retraction is practically relevant and distinct from multi-turn self-correction. - Neat empirical finding: Linear probes of hidden states correlate strongly with retraction but less with ground-truth correctness, clarifying what such probes actually capture. - Causality evidence: Activation steering gives credible leverage that belief directions are not merely correlates but drivers of retraction behavior. - Mechanistic analysis: Patching experiments sugg
- **External validity / scope** All core experiments use three small instruction models and two knowledge-centric datasets. It’s unclear if belief→retraction generalizes to larger or reasoning-style models, to non-factoid tasks, or to tool-augmented settings. - **Evaluation dependence on an LLM judge** Retraction is judged automatically (Llama-3.3-70B). Although convenient, relying on a single judge risks systematic bias and false positives/negatives (e.g., hedged text vs true retraction). Huma
1. The paper introduces a new framework for studying LLM reliability as the testbeds for retraction behaviour. 2. This paper proposed a warning for SFT. The connection between belief and retraction holds even after supervised fine-tuning, showing that improved belief calibration enhances factual alignment and transparency.
1. For table 2, I am curious if the phenomenon is consistent for larger LLM (e.g., meta-llama/Llama-3.1-70B-Instruct). 2. For the training procedure of the linear prob, I think the training dataset (UTQA dataset) is not a continuation dataset. The linear prob trained on this dataset aims to reflect factual correctness but not the retract behaviour, right? The interesting observation is that the linear prob is highly predictive of whether the model will retract its answer. Any additional and de
- Positions “retraction” as a measurable, meaningful behavioral metric for model reliability. - Uses steering and patching methods to demonstrate directional control over behavior. - Provides concrete evidence that attention value vectors, not just weights, mediate belief propagation. - Evaluates multiple model families, increasing robustness. - Links findings to SFT improvements, suggesting paths for aligning model introspection with truthfulness. - Extensive appendices, code availability,
- Focuses on factual QA; results may differ for open-ended or reasoning-heavy tasks (e.g., math or multi-step reasoning). - While linear probes capture useful signals, belief may conflate confidence, calibration, and factual recall. - Although consistent, evaluation could benefit from human verification for robustness. - Steering effects may differ across architectures or prompt styles. - Retracting only ~25% of wrong answers initially limits downstream applicability, even if mechanisms are
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Artificial Intelligence in Healthcare and Education
MethodsSoftmax · Attention Is All You Need
