Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin; Adri\`a Garriga-Alonso

arXiv:2508.16560·cs.LG·December 8, 2025

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin, Adri\`a Garriga-Alonso

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that incorrect setting of the L0 hyperparameter in sparse autoencoders leads to poor feature disentanglement in language models, and proposes a proxy metric to identify the optimal L0.

Contribution

The study reveals the importance of correctly setting L0 in SAEs and introduces a proxy metric to guide this choice, improving feature interpretability in language models.

Findings

01

Incorrect L0 causes feature mixing and degenerate solutions.

02

Proposed proxy metric effectively finds the correct L0.

03

Most existing SAEs use an L0 that is too low.

Abstract

Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper indicates an important issue (feature hedging) that can arise from poor hyperparameter (L0) selection, which is relevant for practitioners to keep in mind when training SAEs. Additionally, the demonstration with toy models that popular "sparsity versus reconstruction" analyses can yield to poor L0 selection may be helpful to the SAE research community.

Weaknesses

It has been a well-established fact since the earliest days of machine learning that poor hyperparameter selection leads to underperforming models. It is not clear to me that the paper makes any contribution beyond showing that this is also true of the L0 hyperparameter when training SAEs. One potential contribution of the work would be in its proposal of "metric" $s_n^{dec}$ that might be useful in estimating L0 values. However, it is not clear what $s_n^{dec}$ it is intended to measure, nor i

Reviewer 02Rating 6Confidence 3

Strengths

1. Clear identification of a critical issue with sparsity tuning in SAEs. 2. Strong empirical evidence with both toy models and comprehensive testing of LLMs(Gemma-2-2b and Llama-3.2-1b). Specifically, Section 3.3 shows an incorrect, feature-mixing SAE achieving a better MSE (2.73) than the perfectly correct ground- truth SAE (MSE 4.88). 3. The proposed (s dec n ) metric (Section 3.5) is well-motivated and is shown to be a useful proxy for feature correctness. 4. Insightful analysis showing that

Weaknesses

1. Lacks theoretical grounding, findings are entirely empirical. 2. Toy models assume orthogonal and linearly separable ground-truth features, which may not represent real LLMs. 3. The metric s dec n requires manual hyperparameter tuning (choice of n, batch size). The paper's own attempt at an automatic optimization algorithm (Appendix A.6) is admitted to be hard and "require a lot of hyper-parameter tuning to work in real LLMs limiting its utility". 4. The authors explicitly state

Reviewer 03Rating 4Confidence 3

Strengths

The authors demonstrate a key limitation of standard parameter selection practices used by SAE practitioners: that SAEs with low L0 are often selected based on sparsity-reconstruction tradeoff analyses yet impede true feature recovery. They develop two metrics, both capturing when L0 is too low in toy models, validating the results on real data from multiple LLMs. They test their metrics on two SOTA SAE architectures. The result is a timely, well-tested toolkit for SAE evaluation that can be

Weaknesses

One thing that I really miss in this paper: You are saying that rate-distortion heuristics (like picking the lowest L0 with the lowest MSE) does not give the most disentangled SAE. Instead you are proposing a different heuristic. If that heuristic is better, then you should be able to just take a bunch of pretrained SAEs from SAEbench and show how your heuristic leads to a better SAE across the different metrics. What is stopping you from doing that experiment? Please include references and loo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning