Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy   Lifting, the Rest Can Be Pruned

Elena Voita; David Talbot; Fedor Moiseev; Rico Sennrich; Ivan Titov

arXiv:1905.09418·cs.CL·June 10, 2019·70 cites

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov

PDF

1 Repo

TL;DR

This paper investigates the importance of individual attention heads in Transformer models, showing that most can be pruned with minimal performance loss, especially those less specialized, while key heads are crucial for maintaining translation quality.

Contribution

The study introduces a novel pruning method based on stochastic gates that effectively removes most attention heads without significant performance degradation.

Findings

01

Specialized heads are crucial and last to be pruned.

02

Pruning 38 out of 48 heads causes only a 0.15 BLEU drop.

03

Most heads can be removed with minimal impact on translation quality.

Abstract

Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lena-voita/the-story-of-heads
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research

MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Multi-Head Attention · Byte Pair Encoding