Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask; Bart Bussmann; Michael Pearce; Joseph Bloom; Curt; Tigges; Noura Al Moubayed; Lee Sharkey; Neel Nanda

arXiv:2502.04878·cs.LG·February 10, 2025

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt, Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda

PDF

Open Access 3 Reviews

TL;DR

This paper challenges the idea that sparse autoencoders discover canonical, atomic features in neural networks, showing they are incomplete and their units are decomposable into smaller, interpretable components.

Contribution

The authors introduce novel techniques, SAE stitching and meta-SAEs, to demonstrate that sparse autoencoders do not find atomic units and are inherently incomplete.

Findings

01

SAE stitching reveals incompleteness of smaller SAEs

02

Meta-SAEs show SAE latents decompose into smaller, interpretable units

03

Latent features often combine multiple concepts, not atomic ones

Abstract

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

Overall, the paper addresses an important question about SAE features with reasonable experiments and presentation. This issue, of whether SAE features are "canonical" or "atomic", could have implications for how we scale up SAEs and also for how we use their latents for model interventions, circuit analysis, etc. Overall, the presentation in the paper is clear and the experiments are original and well-executed. Some particular strong points: * I think Figure 5 is quite compelling. It makes a lo

Weaknesses

* I could not find reported values of the reconstruction loss that meta-SAEs obtain in reconstructing the base 49k-latent SAE latents. How precisely do meta-SAEs actually reconstruct the latents? If they are only a very weak approximation, what would that say about the hypothesis that large-SAE latents are linear combinations of more atomic latents? * Some minor grammatical and presentation issues: "vertexes" -> vertices, the left quotation marks in the meta-SAE section should be fixed, etc.

Reviewer 02Rating 6Confidence 4

Strengths

The paper attempts to understand how activations deep within a transformer might correspond to higher level natural language concepts. This is an interesting topic. The conclusion is also interesting, namely that sparse autoencoders may not form a complete set of explanations, with larger SAEs potentially containing more fine-grained information.

Weaknesses

The paper is quite presumptive about readers knowledge of the topic, lacking in clear explanations in places. Whilst the paper does a reasonable job of explaining sparse encoders, it doesn't explain how these are used to find actual "inputs" that activate particular features. This isn't explained in the paper and I had to read the cited papers to understand this. There is some description in section 5.1 (page 8) but this is too late for the reader unfamiliar with this area to understand the pape

Reviewer 03Rating 8Confidence 3

Strengths

- The motivation and methods are explained clearly and intuitively, with helpful examples. - The authors contextualize their approach by discussing relevant state-of-the-art methods. - Several experiments are conducted to assess the methods' performance, with comparisons to state-of-the-art baselines and detailed experimental information. - An interactive dashboard is included to explore the latents learned with meta-SAEs.

Weaknesses

- Including a discussion on the potential limitations of the proposed approaches would be valuable. - It would also be helpful to have an expanded discussion on assessing the quality of the representations and adapting the dimensionality of the SAE to suit the requirements of different analyses.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Neural Networks and Applications

MethodsSparse Evolutionary Training