Are Sixteen Heads Really Better than One?

Paul Michel; Omer Levy; Graham Neubig

arXiv:1905.10650·cs.CL·November 5, 2019·45 cites

Are Sixteen Heads Really Better than One?

Paul Michel, Omer Levy, Graham Neubig

PDF

Open Access 4 Repos

TL;DR

This paper investigates the necessity of multiple attention heads in neural models, revealing that many can be removed at test time without performance loss, and explores pruning methods for efficiency gains.

Contribution

It demonstrates that many attention heads are redundant at test time and introduces pruning strategies, challenging the assumption that more heads always improve performance.

Findings

01

Many attention heads can be removed without performance loss

02

Some layers can be reduced to a single head

03

Pruning can improve speed and memory efficiency

Abstract

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art NLP models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques

MethodsPruning · Linear Layer · Weight Decay · Residual Connection · Adam · Layer Normalization · Softmax · Attention Is All You Need · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?