SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs
Sean P. Fillingham, Andrew Gordon, Peter Lai, Xavier Poncini, David Quarel, and Stefan Heimersheim

TL;DR
This paper introduces SCALAR, a benchmark for measuring interaction sparsity in sparse autoencoders within neural networks, and proposes Staircase SAEs to improve sparsity and interpretability across different model architectures.
Contribution
We present SCALAR, a novel benchmark for interaction sparsity, and propose Staircase SAEs with weight-sharing to enhance sparsity and interpretability in neural network features.
Findings
Staircase SAEs improve interaction sparsity by up to 63% over TopK SAEs.
JSAEs improve sparsity by 8.5% over TopK in feedforward layers but struggle with transformer blocks.
Staircase SAEs maintain sparsity benefits in both toy models and GPT-2 Small, preserving interpretability.
Abstract
Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer's activations. However, SAEs trained in isolation don't encourage sparse cross-layer connections, inflating extracted circuits where upstream features needlessly affect multiple downstream features. Current evaluations focus on individual SAE performance, leaving interaction sparsity unexamined. We introduce SCALAR (Sparse Connectivity Assessment of Latent Activation Relationships), a benchmark measuring interaction sparsity between SAE features. We also propose "Staircase SAEs", using weight-sharing to limit upstream feature duplication across downstream features. Using SCALAR, we compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs. Staircase SAEs improve relative sparsity over TopK…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning
