Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant; Shan Chen; Kuleen Sasse; Hugo Aerts; Thomas Hartvigsen; Danielle S. Bitterman

arXiv:2502.11367·cs.LG·February 3, 2026

Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

PDF

Open Access 1 Repo

TL;DR

This paper explores the use of Sparse Autoencoders to extract interpretable features from Large Language Models, demonstrating their effectiveness and transferability across tasks and models, with implications for transparent AI deployment.

Contribution

It systematically analyzes SAE configurations for feature extraction from LLMs, establishing new best practices for interpretability and transferability in safety-critical applications.

Findings

01

SAE features outperform baselines with macro F1 > 0.8

02

Features transfer across models from Gemma 2 2B to 9B-IT

03

Binarization maintains or improves performance while simplifying feature selection

Abstract

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shan23chen/mosaic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Speech Recognition and Synthesis · Text and Document Classification Technologies

MethodsFeature Selection