Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora; Zhengxuan Wu; Jacob Steinhardt; Sarah Schwettmann

arXiv:2601.22594·cs.CL·February 2, 2026

Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

PDF

Open Access

TL;DR

This paper demonstrates that MLP neurons in language models are as sparse as autoencoder units, enabling effective circuit tracing and interpretability without extra training, revealing causal neural circuits for language tasks.

Contribution

It empirically shows neuron sparsity in MLPs comparable to sparse autoencoders and develops a pipeline for circuit tracing using gradient-based attribution.

Findings

01

A circuit of approximately 100 neurons controls model behavior on a benchmark.

02

Small neuron sets encode specific reasoning steps like city-to-state mapping.

03

Steering neuron activity can alter model outputs.

Abstract

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 1 0^{2}$ MLP neurons is enough to control model behaviour. On the multi-hop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling