Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Aojie Yuan; Zhiyuan Julian Su; Haiyue Zhang; Yi Nian; Yue Zhao

arXiv:2605.09502·cs.CL·May 12, 2026

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Aojie Yuan, Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao

PDF

TL;DR

The paper reveals that models can internally detect reasoning errors but do not express this awareness externally, and attempts to use this signal to correct errors are unsuccessful, highlighting a boundary in interpretability.

Contribution

It demonstrates the existence of hidden error awareness in models' internal states and shows this signal cannot be used to fix reasoning errors.

Findings

01

Internal error detection signals are highly predictive from early reasoning steps.

02

Verbal confidence does not reflect internal error detection, showing a disconnect.

03

Interventions to correct errors based on internal signals fail, indicating the signal is diagnostic, not causal.

Abstract

Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.