Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov; Andrey Galichin; Alexey Dontsov; Oleg Rogov; Ivan Oseledets; Elena Tutubalina

arXiv:2602.14111·cs.LG·February 17, 2026

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina

PDF

Open Access

TL;DR

This paper critically evaluates Sparse Autoencoders (SAEs), revealing they often fail to recover true features and do not outperform simple random baselines in interpretability and causal editing, questioning their effectiveness.

Contribution

The study provides the first comprehensive evaluation showing SAEs do not reliably decompose neural network features and are comparable to random baselines in key interpretability tasks.

Findings

01

SAEs recover only 9% of true features despite high explained variance

02

Random baselines match SAEs in interpretability, sparse probing, and causal editing

03

SAEs do not reliably decompose models' internal mechanisms

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9%$ of true features despite achieving $71%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning