Alleviating the Inequality of Attention Heads for Neural Machine Translation
Zewei Sun, Shujian Huang, Xin-Yu Dai, Jiajun Chen

TL;DR
This paper introduces HeadMask, a simple masking method to address attention head inequality in Transformer models, leading to improved translation performance across multiple language pairs.
Contribution
It proposes a novel masking technique to balance attention heads in Transformer models, enhancing neural machine translation quality.
Findings
Translation quality improved on multiple language pairs.
Empirical analysis supports the effectiveness of HeadMask.
Addresses attention head imbalance in Transformer models.
Abstract
Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Attention Is All You Need
