AlignSAE: Concept-Aligned Sparse Autoencoders

Minglai Yang; Xinyu Guo; Zhengliang Shi; Jinhe Bi; Steven Bethard; Mihai Surdeanu; Liangming Pan

arXiv:2512.02004·cs.LG·January 14, 2026

AlignSAE: Concept-Aligned Sparse Autoencoders

Minglai Yang, Xinyu Guo, Zhengliang Shi, Jinhe Bi, Steven Bethard, Mihai Surdeanu, Liangming Pan

PDF

Open Access

TL;DR

AlignSAE introduces a curriculum-based method to align sparse autoencoder features with human concepts, enhancing interpretability and control over neural representations in language models.

Contribution

It presents a novel two-phase training approach that aligns autoencoder features with concepts, enabling precise interventions and improved interpretability.

Findings

01

Enables reliable concept swaps via targeted feature slots

02

Supports multi-hop reasoning with aligned features

03

Provides mechanistic insights into generalization dynamics

Abstract

Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a predefined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific concepts can be inspected and controlled without interference from unrelated features.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning