Are Sparse Autoencoder Benchmarks Reliable?

David Chanin

arXiv:2605.18229·cs.LG·May 19, 2026

Are Sparse Autoencoder Benchmarks Reliable?

David Chanin

PDF

1 Models

TL;DR

This paper critically evaluates the reliability of current SAE benchmarks, revealing significant flaws in popular metrics and emphasizing the need for improved evaluation standards.

Contribution

It audits existing SAE quality metrics in SAEBench, identifies their shortcomings, and highlights the necessity for better benchmarks in the field.

Findings

01

TPP and SCR metrics fail multiple evaluation lenses.

02

Current metrics show high reseed noise and low discriminability.

03

sae-probes is the most reliable among tested metrics but still imperfect.

Abstract

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$ -sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
decoderesearch/sae-snapshot-panels
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.