Do Sparse Autoencoders Generalize? A Case Study of Answerability
Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

TL;DR
This paper investigates the generalization ability of sparse autoencoders in language model interpretability, focusing on their capacity to identify answerability across different domains and datasets.
Contribution
It provides an extensive evaluation of SAE feature transferability across diverse answerability datasets, highlighting the variability and challenges in out-of-domain generalization.
Findings
Residual stream probes outperform SAE features within domains.
SAE features show inconsistent out-of-domain transfer performance.
Robust evaluation methods are necessary to assess feature generalization.
Abstract
Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through "answerability" - a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse, partly self-constructed answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features show inconsistent out-of-domain transfer, with performance varying from almost random to outperforming residual stream probes. Overall, this demonstrates the need for robust evaluation methods and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
