Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

TL;DR
This paper explores the use of Sparse Autoencoders to extract interpretable features from Large Language Models, demonstrating their effectiveness and transferability across tasks and models, with implications for transparent AI deployment.
Contribution
It systematically analyzes SAE configurations for feature extraction from LLMs, establishing new best practices for interpretability and transferability in safety-critical applications.
Findings
SAE features outperform baselines with macro F1 > 0.8
Features transfer across models from Gemma 2 2B to 9B-IT
Binarization maintains or improves performance while simplifying feature selection
Abstract
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Speech Recognition and Synthesis · Text and Document Classification Technologies
MethodsFeature Selection
