Alleviating the Inequality of Attention Heads for Neural Machine   Translation

Zewei Sun; Shujian Huang; Xin-Yu Dai; Jiajun Chen

arXiv:2009.09672·cs.CL·September 1, 2022·6 cites

Alleviating the Inequality of Attention Heads for Neural Machine Translation

Zewei Sun, Shujian Huang, Xin-Yu Dai, Jiajun Chen

PDF

Open Access

TL;DR

This paper introduces HeadMask, a simple masking method to address attention head inequality in Transformer models, leading to improved translation performance across multiple language pairs.

Contribution

It proposes a novel masking technique to balance attention heads in Transformer models, enhancing neural machine translation quality.

Findings

01

Translation quality improved on multiple language pairs.

02

Empirical analysis supports the effectiveness of HeadMask.

03

Addresses attention head imbalance in Transformer models.

Abstract

Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Attention Is All You Need