Do Sparse Autoencoders Identify Reasoning Features in Language Models?

George Ma; Zhongyuan Liang; Irene Y. Chen; Somayeh Sojoudi

arXiv:2601.05679·cs.LG·May 19, 2026

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi

PDF

1 Repo

TL;DR

This paper investigates whether sparse autoencoders reliably identify reasoning features in language models, revealing that many features are sensitive to token interventions and emphasizing the importance of falsification in attribution.

Contribution

It introduces a falsification-based evaluation framework for probing reasoning features in language models using sparse autoencoders, highlighting their limitations.

Findings

01

Many contrastively selected features are highly sensitive to token interventions (45%-90%).

02

LLM-guided falsification can produce inputs that suppress reasoning trace activation.

03

Sparse autoencoders often capture low-dimensional correlates rather than true reasoning features.

Abstract

We study how reliably sparse autoencoders (SAEs) support claims about reasoning-related internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain stable low-dimensional correlates while suppressing high-dimensional within-behavior variation, motivating the possibility that contrastively selected "reasoning" features may concentrate on cue-like structure when such cues are coupled with reasoning traces. Building on this perspective, we propose a falsification-based evaluation framework that combines causal token injection with LLM-guided counterexample construction. Across 22 configurations spanning multiple model families, layers, and reasoning datasets, we find that many contrastively selected candidates are highly sensitive to token-level interventions, with 45%-90% activating after injecting only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GeorgeMLP/reasoning-probing
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications