Do Sparse Autoencoders Generalize? A Case Study of Answerability

Lovis Heindrich; Philip Torr; Fazl Barez; Veronika Thost

arXiv:2502.19964·cs.LG·September 8, 2025

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

PDF

Open Access

TL;DR

This paper investigates the generalization ability of sparse autoencoders in language model interpretability, focusing on their capacity to identify answerability across different domains and datasets.

Contribution

It provides an extensive evaluation of SAE feature transferability across diverse answerability datasets, highlighting the variability and challenges in out-of-domain generalization.

Findings

01

Residual stream probes outperform SAE features within domains.

02

SAE features show inconsistent out-of-domain transfer performance.

03

Robust evaluation methods are necessary to assess feature generalization.

Abstract

Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through "answerability" - a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse, partly self-constructed answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features show inconsistent out-of-domain transfer, with performance varying from almost random to outperforming residual stream probes. Overall, this demonstrates the need for robust evaluation methods and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications