Language Model Circuits Are Sparse in the Neuron Basis
Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

TL;DR
This paper demonstrates that MLP neurons in language models are as sparse as autoencoder units, enabling effective circuit tracing and interpretability without extra training, revealing causal neural circuits for language tasks.
Contribution
It empirically shows neuron sparsity in MLPs comparable to sparse autoencoders and develops a pipeline for circuit tracing using gradient-based attribution.
Findings
A circuit of approximately 100 neurons controls model behavior on a benchmark.
Small neuron sets encode specific reasoning steps like city-to-state mapping.
Steering neuron activity can alter model outputs.
Abstract
The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of MLP neurons is enough to control model behaviour. On the multi-hop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
