Multi-head or Single-head? An Empirical Comparison for Transformer   Training

Liyuan Liu; Jialu Liu; Jiawei Han

arXiv:2106.09650·cs.CL·June 18, 2021·23 cites

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Liyuan Liu, Jialu Liu, Jiawei Han

PDF

Open Access 1 Repo

TL;DR

This paper compares multi-head and single-head attention in Transformers, revealing that multi-head's main advantage is training stability, and that deeper single-head models can match or outperform multi-head models with recent training techniques.

Contribution

It demonstrates that multi-head attention's effectiveness is not solely due to attending multiple positions and shows that very deep single-head Transformers can achieve comparable performance with proper training.

Findings

01

Multi-head attention's main benefit is training stability.

02

Deep single-head Transformers can match multi-head performance.

03

Recent training methods enable stable training of very deep Transformers.

Abstract

Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BojanFaletic/IQ_net
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Residual Connection · Dense Connections