Automatically Interpreting Millions of Features in Large Language Models
Gon\c{c}alo Paulo, Alex Mallen, Caden Juang, Nora Belrose

TL;DR
This paper introduces an automated pipeline using large language models to interpret millions of features generated by sparse autoencoders in large language models, enhancing understanding of neural activations.
Contribution
It presents a scalable, open-source method for generating and evaluating natural language explanations for SAE features, including new scoring techniques and interpretability guidelines.
Findings
SAE features are more interpretable than individual neurons.
Intervention scoring reveals features not identified by existing methods.
SAEs trained on nearby layers show high semantic similarity.
Abstract
While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
