Multi-Head Attention: Collaborate Instead of Concatenate

Jean-Baptiste Cordonnier; Andreas Loukas; Martin Jaggi

arXiv:2006.16362·cs.LG·May 21, 2021·76 cites

Multi-Head Attention: Collaborate Instead of Concatenate

Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

PDF

Open Access 2 Repos

TL;DR

This paper introduces a collaborative multi-head attention mechanism that shares key/query projections among heads, reducing parameters and maintaining performance across NLP and vision tasks.

Contribution

It proposes a novel shared projection scheme for multi-head attention, enabling parameter reduction and compatibility with existing transformer models.

Findings

01

Sharing key/query projections maintains accuracy while reducing parameters.

02

The method is effective in language understanding, translation, and vision tasks.

03

Pre-trained models can be re-parametrized into the collaborative attention form.

Abstract

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significant improvement in both fields, but once trained, these networks show symptoms of over-parameterization. For instance, it is known that many attention heads can be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that attention heads learn redundant key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme decreases the number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Our experiments confirm that sharing key/query dimensions can be exploited in language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding