Weight-sparse transformers have interpretable circuits

Leo Gao; Achyuta Rajaram; Jacob Coxon; Soham V. Govande; Bowen Baker; Dan Mossing

arXiv:2511.13653·cs.LG·November 18, 2025

Weight-sparse transformers have interpretable circuits

Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, Dan Mossing

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a method to train weight-sparse transformers that produce highly interpretable circuits, enabling better understanding of model mechanisms while exploring the trade-offs between interpretability and capability.

Contribution

The authors develop a technique for training weight-sparse transformers that yield human-understandable circuits, and analyze how scaling affects interpretability and performance.

Findings

01

Sparse models trade off capability for interpretability.

02

Scaling improves the interpretability-capability balance.

03

Preliminary results show method applicability to dense models.

Abstract

Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OzTianlu/Semigroup_Reasoning_Model_A_Scalpel
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Machine Learning in Materials Science