Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina

TL;DR
This paper critically evaluates Sparse Autoencoders (SAEs), revealing they often fail to recover true features and do not outperform simple random baselines in interpretability and causal editing, questioning their effectiveness.
Contribution
The study provides the first comprehensive evaluation showing SAEs do not reliably decompose neural network features and are comparable to random baselines in key interpretability tasks.
Findings
SAEs recover only 9% of true features despite high explained variance
Random baselines match SAEs in interpretability, sparse probing, and causal editing
SAEs do not reliably decompose models' internal mechanisms
Abstract
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only of true features despite achieving explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
