Data Whitening Improves Sparse Autoencoder Learning
Ashwin Saraswatula, David Klindt

TL;DR
Applying PCA whitening to input activations significantly enhances the interpretability and optimization efficiency of sparse autoencoders across various architectures and metrics, advocating for its standard use in SAE training.
Contribution
This work demonstrates that PCA whitening improves SAE performance and interpretability by transforming the optimization landscape, supported by theoretical analysis and extensive empirical evaluation.
Findings
Whitening improves interpretability metrics like sparse probing accuracy.
Whitening makes the optimization landscape more convex and easier to optimize.
Minor drops in reconstruction quality occur with whitening, but interpretability benefits outweigh these.
Abstract
Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations -- a standard preprocessing technique in classical sparse coding -- improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
