Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, George Lange, Neel Nanda

TL;DR
This paper introduces a framework for evaluating sparse autoencoders in interpretability tasks by comparing unsupervised features to supervised benchmarks, revealing insights into feature quality and training phenomena.
Contribution
It proposes a novel evaluation framework for sparse autoencoders using supervised dictionaries as ground-truth references, enabling more objective interpretability assessments.
Findings
SAEs capture interpretable features for IOI task
SAEs are less effective than supervised features in controlling models
Identified phenomena: feature occlusion and over-splitting
Abstract
Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing · Dropout · Linear Warmup With Cosine Annealing
