SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
David Chanin, Adri\`a Garriga-Alonso

TL;DR
SynthSAEBench is a comprehensive benchmark toolkit for evaluating Sparse Autoencoders on large-scale, realistic synthetic data, enabling precise diagnosis of architectural improvements and failure modes.
Contribution
We introduce SynthSAEBench, a scalable synthetic data benchmark with a standardized model for detailed SAE evaluation and analysis.
Findings
Reproduces known LLM SAE phenomena
Identifies superposition noise exploitation as a failure mode
Highlights overfitting risks with more expressive encoders
Abstract
Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
