Multi-Head Self-Attention with Role-Guided Masks
Dongsheng Wang, Casper Hansen, Lucas Chaves Lima, Christian, Hansen, Maria Maistro, Jakob Grue Simonsen, Christina Lioma

TL;DR
This paper introduces role-guided masks for multi-head self-attention in Transformers, constraining attention heads to specific roles, which improves performance on text classification and translation tasks.
Contribution
It presents a novel method to guide attention heads with role-specific masks, enhancing interpretability and effectiveness of Transformer models.
Findings
Outperforms attention-based, CNN, and RNN baselines on multiple datasets
Improves semantic representations in text tasks
Enhances interpretability of attention heads
Abstract
The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Dropout · Softmax · Label Smoothing · Adam · Dense Connections
