Multi-Head Attention with Disagreement Regularization
Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, Tong Zhang

TL;DR
This paper introduces a disagreement regularization technique for multi-head attention in neural networks, promoting diversity among attention heads to improve translation performance.
Contribution
It proposes three novel types of disagreement regularization to enhance the diversity of attention heads in multi-head attention models.
Findings
Improved translation accuracy on WMT14 English-German.
Enhanced diversity among attention heads.
Demonstrated effectiveness across multiple language pairs.
Abstract
Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
