Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide, Joshua Engels, Eric J. Michaud, Max Tegmark, Christian Schroeder de Witt

TL;DR
This paper introduces Switch Sparse Autoencoders, a new architecture that improves the efficiency of training sparse autoencoders by routing activations through smaller expert networks, enabling scalable and interpretable feature decomposition.
Contribution
The paper proposes Switch Sparse Autoencoders, inspired by mixture of experts, to reduce computational costs and improve scalability of SAEs for feature decomposition.
Findings
Switch SAEs outperform other architectures in reconstruction-sparsity trade-off.
Switch SAEs enable scaling to more features with fixed compute budget.
Features in Switch SAEs remain as interpretable as those in traditional SAEs.
Abstract
Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to scale them up to very high width, posing a computational challenge. In this work, we introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs, enabling SAEs to efficiently scale to many more features. We present experiments comparing Switch SAEs with other SAE architectures, and find that Switch SAEs deliver a substantial Pareto improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget. We also study the geometry of features across experts,…
Peer Reviews
Decision·ICLR 2025 Poster
This work presents an idea to apply a mixture-of-experts layer to sparse autoencoder training and propose Switch sparse autoencoders. However, the architecture itself and the training technique are straightforward applications of existing work. The author conducts experiments comparing the Pareto frontiers of proposed Switch SAEs against a single SAE baseline, showing their better reconstruction performance at similar sparsity levels. I think it is somewhat expected for the mixture model due t
See above.
The authors explore a combination of methods that makes a lot of sense (switch layers and top-k SAEs) and analyze the resulting method in a rigorous and convincing way. They make the interesting findings that switch SAE's are in terms of FLOPS actually cheaper to train (for a given target reconstruction loss) and at the same time require more features in total (which will be an additional challenge for the usual subsequent automatic interpretation of the learnt features). Also they show that the
The idea maybe can be considered as a little bit incremental, but given the very good execution of the paper I don't think that's a big problem. The switch SAEs require more features for the same reconstruction loss, which down the road: for automatic interpretability and SAE feature based interventions presents significant drawbacks. The cost of automatic annotation scales linearly with the number of features to annotate and in my opinion having a giant number of SAE features also makes them l
The strengths of the paper are its simple approach and side studies. - The approach enables a trade-off between model parameters and computational requirements which should be useful to practitioners. - The auxiliary experiments are well-motivated and support the author’s hypotheses. Specifically, the discussions of redundant representations accompanied by t-SNE visualizations are illuminating. - The limitations (parameter efficiency and feature duplication) are discussed in an informative way.
The main weakness of the paper is its quantitative evaluation. - The hyperparameter $\alpha$ is set to 3 (line 258) without further discussion. How was this value chosen? Is this a reasonable default in most settings or is tunning required based on the model and dataset? - The abstract claims a “substantial Pareto improvement” for a fixed training budget. This corresponds to the FLOP-matched experiments shown in the bottom left plot of Figure 3. It is not directly obvious that this constitutes a
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Face and Expression Recognition · Advanced Data Compression Techniques
