Stable and Steerable Sparse Autoencoders with Weight Regularization
Piotr Jedryszek, Oliver M. Crook

TL;DR
This paper investigates how weight regularization improves the stability and steerability of sparse autoencoders across different training runs, enhancing feature consistency and interpretability.
Contribution
It demonstrates that L2 weight regularization, combined with specific training constraints, significantly enhances feature stability and steering success in sparse autoencoders.
Findings
L2 regularization increases feature alignment across seeds.
Regularization doubles steering success rates.
Activation steering becomes more predictable with regularization.
Abstract
Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
