Architecture, Not Scale: Circuit Localization in Large Language Models
Sohan Venkatesh

TL;DR
This paper shows that the architecture of large language models, especially attention mechanisms, significantly influences interpretability, often more than the model size itself.
Contribution
It demonstrates that certain attention architectures lead to more stable and concentrated circuits, challenging the idea that larger models are inherently harder to interpret.
Findings
Grouped query attention yields more stable circuits than standard multi-head attention.
Factual recall circuits in Qwen2.5 undergo a phase transition at a critical scale.
Architectural choices can make large models more interpretable, independent of size.
Abstract
Mechanistic interpretability assumes that circuit analysis becomes harder as models scale. We challenge this assumption by showing that the attention architecture matters more than parameter count. Studying three circuit types across Pythia and Qwen2.5, we find that grouped query attention produces circuits that are far more concentrated and mechanistically stable than standard multi-head attention at comparable scales. The same concentration pattern holds across indirect object identification, induction heads, and factual recall. Within a single architecture family (Qwen2.5), factual recall circuits undergo a discrete phase transition above a critical scale, collapsing to a single bottleneck rather than degrading gradually. These findings suggest that some architectural choices make large models more tractable to study and that interpretability difficulty is not a fixed consequence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
