TL;DR
This paper investigates whether sparse autoencoders reliably identify reasoning features in language models, revealing that many features are sensitive to token interventions and emphasizing the importance of falsification in attribution.
Contribution
It introduces a falsification-based evaluation framework for probing reasoning features in language models using sparse autoencoders, highlighting their limitations.
Findings
Many contrastively selected features are highly sensitive to token interventions (45%-90%).
LLM-guided falsification can produce inputs that suppress reasoning trace activation.
Sparse autoencoders often capture low-dimensional correlates rather than true reasoning features.
Abstract
We study how reliably sparse autoencoders (SAEs) support claims about reasoning-related internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain stable low-dimensional correlates while suppressing high-dimensional within-behavior variation, motivating the possibility that contrastively selected "reasoning" features may concentrate on cue-like structure when such cues are coupled with reasoning traces. Building on this perspective, we propose a falsification-based evaluation framework that combines causal token injection with LLM-guided counterexample construction. Across 22 configurations spanning multiple model families, layers, and reasoning datasets, we find that many contrastively selected candidates are highly sensitive to token-level interventions, with 45%-90% activating after injecting only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
