SplInterp: Improving our Understanding and Training of Sparse Autoencoders
Jeremy Budd, Javier Ideami, Benjamin Macdowall Rynne, Keith Duggar, Randall Balestriero

TL;DR
SplInterp advances the theoretical understanding of sparse autoencoders by framing them within spline theory, introducing a novel training algorithm, and demonstrating improved interpretability and efficiency in experiments.
Contribution
This work provides a theoretical framework for SAEs using spline theory, characterizes their geometry, and introduces PAM-SGD for better training and interpretability.
Findings
SAEs generalize k-means autoencoders to be piecewise affine.
PAM-SGD improves sample efficiency and sparsity in training.
Empirical results show promising performance on MNIST and LLMs.
Abstract
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks
