Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max, Tegmark, and Neel Nanda

TL;DR
This study evaluates the usefulness of sparse autoencoders (SAEs) in interpreting large language models by testing their impact on downstream tasks under challenging conditions, revealing limited advantages over simpler baselines.
Contribution
The paper provides a comprehensive empirical assessment of SAEs in real-world probing tasks, highlighting their limitations and the importance of rigorous downstream evaluation.
Findings
SAEs occasionally outperform baselines on individual datasets
Ensemble methods with SAEs do not consistently outperform baseline ensembles
Simple non-SAE baselines achieve similar interpretability results
Abstract
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs' basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Digital Media Forensic Detection
