Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

Siyu Chen; Heejune Sheen; Xuyuan Xiong; Tianhao Wang; Zhuoran Yang

arXiv:2506.14002·cs.LG·June 18, 2025

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

Siyu Chen, Heejune Sheen, Xuyuan Xiong, Tianhao Wang, Zhuoran Yang

PDF

Open Access

TL;DR

This paper introduces a new theoretical framework and a provably effective training algorithm for Sparse Autoencoders, enabling reliable feature recovery in Large Language Models and improving interpretability.

Contribution

It proposes a novel statistical model for polysemantic features and a bias adaptation training algorithm with proven guarantees for monosemantic feature recovery.

Findings

01

Theoretical proof of feature recovery guarantees for the proposed SAE algorithm.

02

Empirical demonstration of superior performance of GBA on LLMs with up to 1.5 billion parameters.

03

Enhanced interpretability of LLMs through reliable feature extraction.

Abstract

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically \highlight{prove that this algorithm correctly recovers all monosemantic features} when input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling