When Chain-of-Thought Fails, the Solution Hides in the Hidden States
Houman Mehrafarin, Amit Parekh, Ioannis Konstas

TL;DR
This paper investigates how chain-of-thought reasoning in language models contains recoverable, token-level information that can be used to improve answer accuracy through activation patching, revealing insights into reasoning representation.
Contribution
It provides a mechanistic causal analysis showing that task-relevant information is stored in hidden states, especially in mid-to-late layers, and can be leveraged to enhance reasoning accuracy.
Findings
Patching hidden states improves answer accuracy over original CoT and direct prompting.
Task-relevant information is concentrated in mid-to-late layers and appears earlier in correct traces.
Shorter patched outputs can outperform full CoT traces in accuracy.
Abstract
Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
