Multi-Head Self-Attention with Role-Guided Masks

Dongsheng Wang; Casper Hansen; Lucas Chaves Lima; Christian; Hansen; Maria Maistro; Jakob Grue Simonsen; Christina Lioma

arXiv:2012.12366·cs.CL·December 24, 2020

Multi-Head Self-Attention with Role-Guided Masks

Dongsheng Wang, Casper Hansen, Lucas Chaves Lima, Christian, Hansen, Maria Maistro, Jakob Grue Simonsen, Christina Lioma

PDF

Open Access 1 Repo

TL;DR

This paper introduces role-guided masks for multi-head self-attention in Transformers, constraining attention heads to specific roles, which improves performance on text classification and translation tasks.

Contribution

It presents a novel method to guide attention heads with role-specific masks, enhancing interpretability and effectiveness of Transformer models.

Findings

01

Outperforms attention-based, CNN, and RNN baselines on multiple datasets

02

Improves semantic representations in text tasks

03

Enhances interpretability of attention heads

Abstract

The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dswang2011/guided-attention-transformer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Dropout · Softmax · Label Smoothing · Adam · Dense Connections