Multi-head or Single-head? An Empirical Comparison for Transformer Training
Liyuan Liu, Jialu Liu, Jiawei Han

TL;DR
This paper compares multi-head and single-head attention in Transformers, revealing that multi-head's main advantage is training stability, and that deeper single-head models can match or outperform multi-head models with recent training techniques.
Contribution
It demonstrates that multi-head attention's effectiveness is not solely due to attending multiple positions and shows that very deep single-head Transformers can achieve comparable performance with proper training.
Findings
Multi-head attention's main benefit is training stability.
Deep single-head Transformers can match multi-head performance.
Recent training methods enable stable training of very deep Transformers.
Abstract
Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Residual Connection · Dense Connections
