How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen; Aske Plaat; Niki van Stein

arXiv:2507.22928·cs.CL·August 1, 2025

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen, Aske Plaat, Niki van Stein

PDF

Open Access 1 Video

TL;DR

This study investigates how Chain-of-Thought prompting influences the internal reasoning of large language models, revealing that it enhances interpretability and modularity in high-capacity models through feature-level causal analysis.

Contribution

It introduces a feature-level causal framework using sparse autoencoders to analyze CoT faithfulness and demonstrates scale-dependent effects on model interpretability and reasoning.

Findings

01

CoT features significantly influence answer probabilities in 2.8B models.

02

Higher activation sparsity and interpretability scores with CoT in larger models.

03

Useful CoT information is widely distributed across features, not just top patches.

Abstract

Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How Does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding· underline

Taxonomy

TopicsMental Health Research Topics · Explainable Artificial Intelligence (XAI) · Functional Brain Connectivity Studies