How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
Jorge Garc\'ia-Carrasco, Alejandro Mat\'e, Juan Trujillo

TL;DR
This paper investigates how GPT-2 predicts acronyms by reverse-engineering its internal mechanisms, revealing a specialized circuit of attention heads that handle multi-token predictions using positional information.
Contribution
It introduces the first mechanistic interpretability analysis of multi-token prediction behavior in GPT-2, identifying a specific circuit of attention heads responsible for acronym prediction.
Findings
A circuit of 8 attention heads predicts acronyms.
Heads are grouped into three functional categories.
Positional information is used via the causal mask mechanism.
Abstract
Transformer-based language models are treated as black-boxes because of their large number of parameters and complex internal interactions, which is a serious safety concern. Mechanistic Interpretability (MI) intends to reverse-engineer neural network behaviors in terms of human-understandable components. In this work, we focus on understanding how GPT-2 Small performs the task of predicting three-letter acronyms. Previous works in the MI field have focused so far on tasks that predict a single token. To the best of our knowledge, this is the first work that tries to mechanistically understand a behavior involving the prediction of multiple consecutive tokens. We discover that the prediction is performed by a circuit composed of 8 attention heads (~5% of the total heads) which we classified in three groups according to their role. We also demonstrate that these heads concentrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Neural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Weight Decay · Cosine Annealing · Attention Dropout · Dropout · Linear Warmup With Cosine Annealing · Residual Connection · Softmax
