Feature Starvation as Geometric Instability in Sparse Autoencoders
Faris Chaudhry, Keisuke Yano, Anthea Monod

TL;DR
This paper identifies feature starvation in sparse autoencoders as a fundamental geometric instability and proposes a new adaptive elastic net architecture to address it, improving interpretability and stability.
Contribution
The paper introduces AEN-SAEs, a differentiable architecture combining elastic net regularization to mitigate feature starvation in sparse autoencoders, grounded in classical sparse regression theory.
Findings
AEN-SAEs recover feature support under mild conditions.
AEN-SAEs mitigate feature starvation without heuristics.
Empirical results on LLMs show improved feature stability.
Abstract
Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard -regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the -induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
