AlignSAE: Concept-Aligned Sparse Autoencoders
Minglai Yang, Xinyu Guo, Zhengliang Shi, Jinhe Bi, Steven Bethard, Mihai Surdeanu, Liangming Pan

TL;DR
AlignSAE introduces a curriculum-based method to align sparse autoencoder features with human concepts, enhancing interpretability and control over neural representations in language models.
Contribution
It presents a novel two-phase training approach that aligns autoencoder features with concepts, enabling precise interventions and improved interpretability.
Findings
Enables reliable concept swaps via targeted feature slots
Supports multi-hop reasoning with aligned features
Provides mechanistic insights into generalization dynamics
Abstract
Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a predefined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific concepts can be inspected and controlled without interference from unrelated features.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
