Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni; Joshua Engels; Senthooran Rajamanoharan; Max; Tegmark; and Neel Nanda

arXiv:2502.16681·cs.LG·February 25, 2025

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max, Tegmark, and Neel Nanda

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This study evaluates the usefulness of sparse autoencoders (SAEs) in interpreting large language models by testing their impact on downstream tasks under challenging conditions, revealing limited advantages over simpler baselines.

Contribution

The paper provides a comprehensive empirical assessment of SAEs in real-world probing tasks, highlighting their limitations and the importance of rigorous downstream evaluation.

Findings

01

SAEs occasionally outperform baselines on individual datasets

02

Ensemble methods with SAEs do not consistently outperform baseline ensembles

03

Simple non-SAE baselines achieve similar interpretability results

Abstract

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs' basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joshengels/sae-probes
pytorchOfficial

Datasets

serteal/sparse-probing
dataset· 248 dl
248 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Digital Media Forensic Detection