Discovering Variable Binding Circuitry with Desiderata
Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David, Bau

TL;DR
This paper introduces a method to automatically identify specific model components responsible for subtasks in language models by specifying desired causal attributes, demonstrated by discovering variable binding circuitry in LLaMA-13B.
Contribution
The paper presents a novel approach extending causal mediation experiments to automatically find model components responsible for subtasks using desiderata, applied to variable binding in LLaMA-13B.
Findings
Localized variable binding to 9 attention heads and 1 MLP in LLaMA-13B
Successfully identified components responsible for arithmetic variable retrieval
Method generalizes causal mediation for automatic circuit discovery
Abstract
Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
