Copy Suppression: Comprehensively Understanding an Attention Head

Callum McDougall; Arthur Conmy; Cody Rushing; Thomas McGrath; Neel; Nanda

arXiv:2310.04625·cs.LG·October 10, 2023·2 cites

Copy Suppression: Comprehensively Understanding an Attention Head

Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel, Nanda

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper thoroughly investigates a specific attention head in GPT-2 Small, revealing its role in suppressing naive copying to improve model calibration and enabling self-repair mechanisms, with comprehensive mechanistic explanations.

Contribution

It provides the most detailed analysis to date of a single model component's role, specifically uncovering how copy suppression operates in GPT-2 Small.

Findings

01

Attention Head 10.7 suppresses naive copying behavior.

02

Copy suppression explains 76.9% of the head's impact.

03

Copy suppression contributes to 39% of self-repair mechanisms.

Abstract

We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it: we call this copy suppression. Attention Head 10.7 (L10H7) suppresses naive copying behavior which improves overall model calibration. This explains why multiple prior works studying certain narrow tasks found negative heads that systematically favored the wrong answer. We uncover the mechanism that the Negative Heads use for copy suppression with weights-based evidence and are able to explain 76.9% of the impact of L10H7 in GPT-2 Small. To the best of our knowledge, this is the most comprehensive description of the complete role of a component in a language model to date. One major effect of copy suppression is its role in self-repair. Self-repair…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

* The paper carefully examines a particular head, including analysis and visualizations.

Weaknesses

I would like to preface the discussion here with the comment that perhaps I am not the ideal audience for this paper. But from my personal impression as someone familiar with language modeling, and also interested in model interpretability, I looked at the main contribution of the paper: > Our central claim is that at least 76.9% of the role of attention head L10H7 on GPT-2 Small's training distribution is copy suppression. and was left with the impression "I'm not sure why I care about this r

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

1. This paper presents an interesting hypothesis called "copy suppression": If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it. The paper conducts extensive experiments to verify this hypothesis. The results show that a single head can play a complete role, which helps deepen our understanding of attention heads. 2. Copy suppression helps to understand the self-repair phenomenon, and the author conducts a quantitative an

Weaknesses

1. The conclusions given in the paper about transferability across different model classes, sizes, and data are not clear. In my opinion, this is the biggest issue with this paper. Although the author's experiments involve other models such as GPT-2 medium and Pythia besides GPT-2 small, it still does not eliminate concerns about this issue. The unclear applicability of the conclusions makes it difficult to assess the paper's contribution. 2. The presentation of this paper is not clear enough. F

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. This paper defines copy suppression, namely the main role of an attention head across GPT-2 Small's training distribution. Then they apply weights-based arguments to analyze the hypotheses about copy suppression. 2. Experiments demonstrate that copy suppression explains 39% of self-repair in one setting and copy suppression with weights-based evidence and can explain 76.9% of the impact of L10H7 in GPT-2 Small.

Weaknesses

This work only explores the findings on GPT2 models, and it would be better to verify it on more and larger models.

Code & Models

Repositories

callummcdougall/seri-mats-2023-streamlit-pages
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices · Explainable Artificial Intelligence (XAI)

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Discriminative Fine-Tuning · Dropout · Weight Decay · Multi-Head Attention · Softmax · Byte Pair Encoding · Linear Warmup With Cosine Annealing