Automatically Interpreting Millions of Features in Large Language Models

Gon\c{c}alo Paulo; Alex Mallen; Caden Juang; Nora Belrose

arXiv:2410.13928·cs.LG·August 7, 2025·2 cites

Automatically Interpreting Millions of Features in Large Language Models

Gon\c{c}alo Paulo, Alex Mallen, Caden Juang, Nora Belrose

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces an automated pipeline using large language models to interpret millions of features generated by sparse autoencoders in large language models, enhancing understanding of neural activations.

Contribution

It presents a scalable, open-source method for generating and evaluating natural language explanations for SAE features, including new scoring techniques and interpretability guidelines.

Findings

01

SAE features are more interpretable than individual neurons.

02

Intervention scoring reveals features not identified by existing methods.

03

SAEs trained on nearby layers show high semantic similarity.

Abstract

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eleutherai/sae-auto-interp
pytorchOfficial

Models

🤗
microsoft/maira-2-sae
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training