Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
Aojie Yuan, Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao

TL;DR
The paper reveals that models can internally detect reasoning errors but do not express this awareness externally, and attempts to use this signal to correct errors are unsuccessful, highlighting a boundary in interpretability.
Contribution
It demonstrates the existence of hidden error awareness in models' internal states and shows this signal cannot be used to fix reasoning errors.
Findings
Internal error detection signals are highly predictive from early reasoning steps.
Verbal confidence does not reflect internal error detection, showing a disconnect.
Interventions to correct errors based on internal signals fail, indicating the signal is diagnostic, not causal.
Abstract
Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
