Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
Micha{\l} Brzozowski, Neo Christopher Chung

TL;DR
Aligned training is a parameter-free reparameterization method for sparse autoencoders that enhances feature quality, stability, and reconstruction without additional hyperparameters or computational cost.
Contribution
The paper introduces aligned training, a novel geometric constraint for SAEs that improves feature activation, stability, and interpretability across various architectures and settings.
Findings
Significant improvements in feature activation and stability across models.
Enhanced reconstruction quality and interpretability.
Compatibility with existing interpretability techniques.
Abstract
Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbf{aligned training}, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbf{alignment score}), follows a bimodal distribution across all modern architectures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
