A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Yiming Tang; Harshvardhan Saini; Zhaoqian Yao; Zheng Lin; Yizhen Liao; Jingyi Cui; Yisen Wang; Mengnan Du; Dianbo Liu

arXiv:2512.05534·cs.LG·May 5, 2026

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Yiming Tang, Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu

PDF

TL;DR

This paper develops a unified theoretical framework for sparse dictionary learning in mechanistic interpretability, explaining phenomena like feature absorption and dead neurons, and introduces a technique to improve feature recovery.

Contribution

It presents the first comprehensive theory casting SDL methods as a piecewise biconvex problem, characterizing solutions and pathologies, and proposes feature anchoring to enhance interpretability.

Findings

01

Unified framework explains polysemantic features and dead neurons.

02

Introduces Linear Representation Bench for ground-truth analysis.

03

Feature anchoring significantly improves feature recovery.

Abstract

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.