Evaluating SAE interpretability without explanations
Gon\c{c}alo Paulo, Nora Belrose

TL;DR
This paper proposes a new method for evaluating the interpretability of sparse autoencoders that does not rely on natural language explanations, enabling more direct and standardized assessments.
Contribution
It introduces adapted interpretability metrics for sparse autoencoders that bypass explanation generation and compares these metrics with human evaluations for validation.
Findings
New interpretability metrics correlate with human judgments.
Method allows direct assessment without natural language explanations.
Provides guidelines for standardizing SAE interpretability evaluation.
Abstract
Sparse autoencoders (SAEs) and transcoders have become important tools for machine learning interpretability. However, measuring how interpretable they are remains challenging, with weak consensus about which benchmarks to use. Most evaluation procedures start by producing a single-sentence explanation for each latent. These explanations are then evaluated based on how well they enable an LLM to predict the activation of a latent in new contexts. This method makes it difficult to disentangle the explanation generation and evaluation process from the actual interpretability of the latents discovered. In this work, we adapt existing methods to assess the interpretability of sparse coders, with the advantage that they do not require generating natural language explanations as an intermediate step. This enables a more direct and potentially standardized assessment of interpretability.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
