SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
Constantin Venhoff, Anisoara Calinescu, Philip Torr, Christian, Schroeder de Witt

TL;DR
SAGE introduces a scalable ground truth evaluation framework for sparse autoencoders, enabling large-scale interpretability assessments without extensive prior knowledge or toy models.
Contribution
The paper presents SAGE, a novel scalable evaluation method for SAEs that reduces training overhead and applies to large models and diverse tasks.
Findings
Successfully evaluated SAEs on large models like GPT-2 Small and Pythia70M.
Reduced training overhead with a new residual stream reconstruction method.
Demonstrated scalability and generalizability of the evaluation framework.
Abstract
A key challenge in interpretability is to decompose model activations into meaningful features. Sparse autoencoders (SAEs) have emerged as a promising tool for this task. However, a central problem in evaluating the quality of SAEs is the absence of ground truth features to serve as an evaluation gold standard. Current evaluation methods for SAEs are therefore confronted with a significant trade-off: SAEs can either leverage toy models or other proxies with predefined ground truth features; or they use extensive prior knowledge of realistic task circuits. The former limits the generalizability of the evaluation results, while the latter limits the range of models and tasks that can be used for evaluations. We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, a ground truth evaluation framework for SAEs that scales to large state-of-the-art SAEs and models. We demonstrate…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Studying intrinsic features that are interpretable in large language models is an interesting and important direction. The large sparse autoencoders (SAEs) and indirect object identification (IOI) look promising. This is a relatively new direction and the proposed method and findings make a good attempt in this direction. The proposed reconstruction method seems to be sound.
This is a paper difficult to read. It is mainly based on two prior works, [1] . "Interpretability in the wild: a circuit for indirect object identification in GPT-2 small", Wang et al. 2023, which defines the problem of indirect object identification (IOI) with a discovery of circuit; and [2] "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control", Makelov et al. that gives a framework for supervised learning of sparse autoencoders (SAEs) for the indirect o
+ Interpretation of components in LLMs is a highly relevant topic. + SAEs is an interesting direction that recently attracted attention. + Automated identification of cross-sections is a step forward. The previous work relied on prior knowledge and thus was limited on what it could be applied. Using automated discovery of cross-sections, this limitation is alleviated.
- overall, I think the problem definition and the description can be better written. Author rely heavily on the previous work, i.e., Makelov et al., for their description. However, the current article on its own is not trivial to understand. - It is unclear how the supervised dictionaries are extracted in this work and whether this is something new. Makelov already discusses in appendix A.6 the MSE estimation of the supervised dictionaries. Furthermore, in that model $u$ vectors are
The paper is dealing with a problem that Is of interest to the ICLR community.
The main weakness of this paper Is it’s presentation, it is very difficult to get through the manuscript and understand the introduced evaluation framework. One would expect to have a clear answer to the following three questions after reading the abstract and the intro: (1) what is done, (2) why it is done, and (3) how it is done. However, after reading the paper multiple times, the reviewer still struggles to find the responses to the above-mentioned questions. For details, please see Questio
The chosen task is important and highly relevant.
Philosophically, I don’t understand the point of this paper. If I am reading it correctly, and it’s possible I am not, the idea is to generate pseudo-ground truth features via “supervised feature dictionary learning” for cross-sections corresponding to features of interest (which are also identified automatically). Then it is claimed these features can be used to evaluate SAEs. But this seems to me to defy the whole point of SAE evaluation in the first place, which is to isolate network componen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Digital Media Forensic Detection · Adversarial Robustness in Machine Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Attention Dropout · Attention Is All You Need · Discriminative Fine-Tuning · Linear Layer · Weight Decay · Cosine Annealing · Dropout · Byte Pair Encoding
