Multi-Head Attention with Disagreement Regularization

Jian Li; Zhaopeng Tu; Baosong Yang; Michael R. Lyu; Tong Zhang

arXiv:1810.10183·cs.CL·October 25, 2018·22 cites

Multi-Head Attention with Disagreement Regularization

Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, Tong Zhang

PDF

Open Access

TL;DR

This paper introduces a disagreement regularization technique for multi-head attention in neural networks, promoting diversity among attention heads to improve translation performance.

Contribution

It proposes three novel types of disagreement regularization to enhance the diversity of attention heads in multi-head attention models.

Findings

01

Improved translation accuracy on WMT14 English-German.

02

Enhanced diversity among attention heads.

03

Demonstrated effectiveness across multiple language pairs.

Abstract

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications