Interpretability in the Wild: a Circuit for Indirect Object   Identification in GPT-2 small

Kevin Wang; Alexandre Variengien; Arthur Conmy; Buck Shlegeris and; Jacob Steinhardt

arXiv:2211.00593·cs.LG·November 2, 2022·50 cites

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris and, Jacob Steinhardt

PDF

Open Access 5 Repos 1 Datasets 1 Video

TL;DR

This paper presents a detailed mechanistic explanation of how GPT-2 small performs indirect object identification, involving 26 attention heads, and demonstrates the feasibility of understanding complex behaviors in language models.

Contribution

It introduces a comprehensive, causal intervention-based interpretability analysis of GPT-2 small's behavior on a natural language task, bridging the gap between simple and complex model interpretability.

Findings

01

26 attention heads grouped into 7 classes explain IOI behavior

02

Explanation is supported by faithfulness, completeness, and minimality criteria

03

Work demonstrates the feasibility of mechanistic understanding in language models

Abstract

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

NeelNanda/counterfact-tracing
dataset· 623 dl
623 dl

Videos

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Materials Science

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Cosine Annealing · Byte Pair Encoding · Residual Connection · Dropout · Softmax · Adam