Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels, Logan Riggs, Max Tegmark

TL;DR
This paper investigates the unexplained variance ('dark matter') in sparse autoencoders for language models, revealing that much of it is linearly predictable and that nonlinear components are fundamentally different, affecting model performance.
Contribution
It demonstrates that a significant portion of SAE dark matter is linearly predictable from initial activations and introduces methods to analyze and reduce nonlinear error components.
Findings
About half of SAE error can be linearly predicted from initial activations.
Larger SAEs struggle with the same contexts as smaller ones, showing predictable scaling.
Nonlinear error contains fewer learned features and impacts downstream performance.
Abstract
Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter -- about half of the error vector itself and >90% of its norm -- can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations. These insights imply that the part of the SAE error vector…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper’s analysis is very thorough and a very compelling decomposition of the scaling of sparse autoencoders for feature extraction. The observations are very well described, and the theory and methodology is sound. I think it would be an impactful work, however, it does have one fatal shortcoming: (see below)
Weaknesses The main weakness is that the method is only applied in one model, and even worse, only one layer in the model. The methodology is self described as empirical, but because only one context is analyzed, the findings are not sufficiently proven because the paper essentially shows only one data point. That is, is it possible that this only happens in this one layer for this one model? I would not think that the findings are exclusive to just this one context, but would it be possible to
* I think that the biggest two problems with Sparse Autoencoders and Mechanistic Intepretability research is i) faithfulness of decomposition of models and ii) real-world application of insights. This paper is solid progress on i) because they ask why SAEs are currently limited. * The paper describes a sensible theoretical decomposition of model activations into SAE learned features, dense features and non-linear features, and also measures these terms in practice. * The paper does a wide ran
1. **Large sections of the paper are difficult to understand** a. Some notation is defined in strange ways that require me to reread them many times. E.g. `SaeError(x) := x - Sae(x)` is defined with a -1 coefficient for `Sae(x)`, but `NonlinearError(x) := Sae(x) - Wx - \sum_{i=0}^{m} w_i \vec{y}_i` is defined with a +1 coefficient for `Sae(x)`. I *think* that most of these problems are downstream of defining some parts of the notation with an end state in mind, e.g. the weak linear representat
1. This paper focuses on the commonly ignored detail—SAE error—to provide more insights for thoroughly understanding a research subject related to the representation of language. 2. It examines SAE error from multiple aspects, such as scaling law, norm prediction test, etc. 3. It investigates ways to reduce NonlinearError in detail.
# Major To be honest, I had a very hard time understanding this paper. Specifically, 1. How do you define the term “most common” (features..) on line 146? It is ambiguous. 2. Section 4.1 is not well written, which makes me hard to understand the later sections of the paper. Specifically: 1. Line 193 and 194 do not make sense. For instance, Dense is nonlinear, how can the sum between Wx and Dense be linear component of the error? 2. What leads you to make the statement on line 200 to 2
- The core idea to decompose the SAE reconstruction area into various parts is very interesting and, as far as I am aware, novel. Having such a decomposition could potentially be very useful for training better SAEs and informing the debate around the extent to which the linear representation hypothesis holds. - The authors attempt to measure these quantities using a wide range of experiments. They also study the downstream effects of each component of their decomposition, and investigate method
The main weakness of this paper is that the central quantities are not defined clearly enough for me to be able to properly understand what is going on, or the adequacy of their empirical experiments. My current understanding, rephrased in my own language is that the authors are making the following decomposition: - Activation vector $x$ - The sparse autoencoder reconstruction: $\textup{Sae}(x)$ - The error of the SAE reconstruction: $\textup{SaeError}(x)$ - Unlearned sparse feat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDark Matter and Cosmic Phenomena · CCD and CMOS Imaging Sensors · Particle Detector Development and Performance
