Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Shruti Joshi; Andrea Dittadi; S\'ebastien Lachapelle; Dhanya Sridhar

arXiv:2502.12179·cs.LG·March 3, 2026

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Shruti Joshi, Andrea Dittadi, S\'ebastien Lachapelle, Dhanya Sridhar

PDF

Open Access

TL;DR

This paper introduces Sparse Shift Autoencoders (SSAEs), a novel method for interpretable and controllable concept extraction from large language model activations, ensuring identifiability with weak supervision.

Contribution

The paper proposes SSAEs that learn sparse differences between embeddings, providing a theoretically grounded approach for identifiable concept disentanglement in LLM interpretability.

Findings

01

SSAEs achieve identifiable concept recovery across multiple datasets

02

Disentanglement of activations from different LLMs demonstrated

03

Steering specific concepts with weak supervision is effective

Abstract

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches alleviate the need for supervision from concept labels, paired prompts, or explicit causal models. On the other hand, without additional assumptions, SAEs are not guaranteed to be identifiable. In practice, they may learn latent dimensions that entangle multiple underlying concepts. If we use these dimensions to extract vectors for steering specific LLM behaviours, this non-identifiability might result in interventions that inadvertently affect unrelated properties. In this paper, we bring the question of identifiability to the forefront of LLM interpretability research. Specifically, we introduce Sparse Shift Autoencoders (SSAEs) which learn sparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Gaussian Processes and Bayesian Inference · Machine Learning and Data Classification