How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit   via Mechanistic Interpretability

Jorge Garc\'ia-Carrasco; Alejandro Mat\'e; Juan Trujillo

arXiv:2405.04156·cs.LG·May 8, 2024·3 cites

How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability

Jorge Garc\'ia-Carrasco, Alejandro Mat\'e, Juan Trujillo

PDF

Open Access 1 Repo

TL;DR

This paper investigates how GPT-2 predicts acronyms by reverse-engineering its internal mechanisms, revealing a specialized circuit of attention heads that handle multi-token predictions using positional information.

Contribution

It introduces the first mechanistic interpretability analysis of multi-token prediction behavior in GPT-2, identifying a specific circuit of attention heads responsible for acronym prediction.

Findings

01

A circuit of 8 attention heads predicts acronyms.

02

Heads are grouped into three functional categories.

03

Positional information is used via the causal mask mechanism.

Abstract

Transformer-based language models are treated as black-boxes because of their large number of parameters and complex internal interactions, which is a serious safety concern. Mechanistic Interpretability (MI) intends to reverse-engineer neural network behaviors in terms of human-understandable components. In this work, we focus on understanding how GPT-2 Small performs the task of predicting three-letter acronyms. Previous works in the MI field have focused so far on tasks that predict a single token. To the best of our knowledge, this is the first work that tries to mechanistically understand a behavior involving the prediction of multiple consecutive tokens. We discover that the prediction is performed by a circuit composed of 8 attention heads (~5% of the total heads) which we classified in three groups according to their role. We also demonstrate that these heads concentrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jgcarrasco/acronyms_paper
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Neural Networks and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Weight Decay · Cosine Annealing · Attention Dropout · Dropout · Linear Warmup With Cosine Annealing · Residual Connection · Softmax