Talking-Heads Attention

Noam Shazeer; Zhenzhong Lan; Youlong Cheng; Nan Ding; Le Hou

arXiv:2003.02436·cs.LG·March 6, 2020·49 cites

Talking-Heads Attention

Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

PDF

Open Access 4 Repos

TL;DR

Talking-heads attention is a novel variation of multi-head attention that incorporates linear projections across attention heads, improving language modeling and transfer learning performance with minimal additional complexity.

Contribution

It introduces talking-heads attention, a new multi-head attention mechanism with linear projections across heads, enhancing model performance with few extra parameters.

Findings

01

Improves perplexities on masked language modeling tasks

02

Enhances transfer-learning quality for language comprehension

03

Adds minimal parameters and computation

Abstract

We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Talking-Heads Attention · Multi-Head Attention · Softmax