Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris and, Jacob Steinhardt

TL;DR
This paper presents a detailed mechanistic explanation of how GPT-2 small performs indirect object identification, involving 26 attention heads, and demonstrates the feasibility of understanding complex behaviors in language models.
Contribution
It introduces a comprehensive, causal intervention-based interpretability analysis of GPT-2 small's behavior on a natural language task, bridging the gap between simple and complex model interpretability.
Findings
26 attention heads grouped into 7 classes explain IOI behavior
Explanation is supported by faithfulness, completeness, and minimality criteria
Work demonstrates the feasibility of mechanistic understanding in language models
Abstract
Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Materials Science
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Cosine Annealing · Byte Pair Encoding · Residual Connection · Dropout · Softmax · Adam
