Weight-sparse transformers have interpretable circuits
Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, Dan Mossing

TL;DR
This paper introduces a method to train weight-sparse transformers that produce highly interpretable circuits, enabling better understanding of model mechanisms while exploring the trade-offs between interpretability and capability.
Contribution
The authors develop a technique for training weight-sparse transformers that yield human-understandable circuits, and analyze how scaling affects interpretability and performance.
Findings
Sparse models trade off capability for interpretability.
Scaling improves the interpretability-capability balance.
Preliminary results show method applicability to dense models.
Abstract
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Machine Learning in Materials Science
