TL;DR
This paper critically evaluates the reliability of current SAE benchmarks, revealing significant flaws in popular metrics and emphasizing the need for improved evaluation standards.
Contribution
It audits existing SAE quality metrics in SAEBench, identifies their shortcomings, and highlights the necessity for better benchmarks in the field.
Findings
TPP and SCR metrics fail multiple evaluation lenses.
Current metrics show high reseed noise and low discriminability.
sae-probes is the most reliable among tested metrics but still imperfect.
Abstract
Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of -sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
