TL;DR
This paper investigates how sparse autoencoders (SAEs) capture concept manifolds, providing a theoretical framework and empirical analysis showing they do so in global or local ways, with implications for interpretability.
Contribution
It introduces a theoretical framework for understanding manifold capture by SAEs and reveals their limitations and regimes of operation, guiding future interpretability methods.
Findings
SAEs can capture manifolds globally or locally.
SAEs often mix global and local solutions in a fragmented regime.
Manifold structure is rarely visible at the level of individual concepts.
Abstract
Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
