Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations
Shruti Joshi, Andrea Dittadi, S\'ebastien Lachapelle, Dhanya Sridhar

TL;DR
This paper introduces Sparse Shift Autoencoders (SSAEs), a novel method for interpretable and controllable concept extraction from large language model activations, ensuring identifiability with weak supervision.
Contribution
The paper proposes SSAEs that learn sparse differences between embeddings, providing a theoretically grounded approach for identifiable concept disentanglement in LLM interpretability.
Findings
SSAEs achieve identifiable concept recovery across multiple datasets
Disentanglement of activations from different LLMs demonstrated
Steering specific concepts with weak supervision is effective
Abstract
Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches alleviate the need for supervision from concept labels, paired prompts, or explicit causal models. On the other hand, without additional assumptions, SAEs are not guaranteed to be identifiable. In practice, they may learn latent dimensions that entangle multiple underlying concepts. If we use these dimensions to extract vectors for steering specific LLM behaviours, this non-identifiability might result in interventions that inadvertently affect unrelated properties. In this paper, we bring the question of identifiability to the forefront of LLM interpretability research. Specifically, we introduce Sparse Shift Autoencoders (SSAEs) which learn sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Gaussian Processes and Bayesian Inference · Machine Learning and Data Classification
