On the Limits of Sparse Autoencoders: A Theoretical Framework and Reweighted Remedy

Jingyi Cui; Qi Zhang; Yifei Wang; Yisen Wang

arXiv:2506.15963·cs.LG·March 5, 2026

On the Limits of Sparse Autoencoders: A Theoretical Framework and Reweighted Remedy

Jingyi Cui, Qi Zhang, Yifei Wang, Yisen Wang

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of sparse autoencoders, revealing their limitations in recovering true features unless extremely sparse, and proposes a reweighting strategy to enhance interpretability and feature recovery.

Contribution

It offers the first closed-form theoretical analysis of SAEs, identifies their failure modes, and introduces a weighted SAE method with a principled weight selection for better feature recovery.

Findings

01

SAEs generally fail to recover ground truth features unless extremely sparse.

02

The proposed weighted SAE significantly improves feature monosemanticity.

03

Theoretical analysis is validated through experiments across multiple settings.

Abstract

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the features learned by large language models (LLMs). By reconstructing features with sparsely activated networks, SAEs aim to recover complex superposed polysemantic features into interpretable monosemantic ones. Despite their wide applications, it remains unclear under what conditions SAEs can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, we provide the first theoretical analysis with a closed-form solution for SAEs, revealing that they generally fail to fully recover the ground truth monosemantic features unless the ground truth features are extremely sparse. To improve the feature recovery of SAEs in general cases, we propose a reweighting strategy targeting at enhancing the reconstruction of the ground truth monosemantic features instead…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The paper provides a detailed theoretical analysis. - The paper proposes a new re-weighting strategy that may reduce polysemanticity. - the paper tests their new strategy on language models.

Weaknesses

- The paper assumes an incorrect model of the underlying data distribution which is no longer considered valid by many researchers in the field. There is not one true set of non-overlapping features, but rather many kinds of features which overlap with one another. Additionally, many features, such as parts of speech are dense. While much work on SAEs has assumed sparse features in a non-overlapping basis - this was more reasonable to do a few years ago before we'd seen so much object level data

Reviewer 02Rating 6Confidence 3

Strengths

- **Mathematically grounded analysis.** The derivation of a closed-form SAE solution and identification of feature shrinking/vanishing phenomena clarify long-standing empirical observations in mechanistic interpretability. While related to classical sparse coding theory, the explicit analytical form for the ReLU-based SAE and the formal characterization of these degradation modes constitute a novel operational understanding that was not previously formalized. - **Bridging theory and pra

Weaknesses

### 1. Relationship to dictionary learning and identifiability The theoretical results closely parallel classical **identifiability conditions in sparse dictionary learning**, where full recovery requires incoherence or extreme sparsity of the underlying basis. While this correspondence is intuitive, it is not explicitly discussed in the paper. Clarifying this relationship would enhance the theoretical positioning of the work. In particular: - Theorem 1–3 can be viewed as a **nonlinear (R

Reviewer 03Rating 8Confidence 3

Strengths

The paper’s originality lies in providing a formal, mathematical explanation for why SAEs sometimes fail to identify interpretable features, moving beyond the purely empirical understanding that dominates current mechanistic interpretability work. The authors establish clear analytical results—closed-form solutions, necessary and sufficient conditions, and a uniqueness theorem—that rigorously connect feature sparsity with successful monosemantic recovery. This theoretical grounding fills a long-

Weaknesses

The main limitation is the scope of empirical validation. The experiments are primarily performed on small or medium-scale models (Pythia-160M, ResNet-18) and under controlled settings. While these choices are appropriate for validating the theory, it remains unclear how well the findings extend to large modern LLMs or to deeper, multi-layer SAE architectures that are increasingly used in practice. Furthermore, the superposition assumption in the theoretical model treats representations as linea

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Generative Adversarial Networks and Image Synthesis