Talking-Heads Attention
Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

TL;DR
Talking-heads attention is a novel variation of multi-head attention that incorporates linear projections across attention heads, improving language modeling and transfer learning performance with minimal additional complexity.
Contribution
It introduces talking-heads attention, a new multi-head attention mechanism with linear projections across heads, enhancing model performance with few extra parameters.
Findings
Improves perplexities on masked language modeling tasks
Enhances transfer-learning quality for language comprehension
Adds minimal parameters and computation
Abstract
We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Talking-Heads Attention · Multi-Head Attention · Softmax
