Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao; Qingye Meng; Shengping Li; Xingyuan Yuan

arXiv:2405.08553·cs.LG·June 5, 2024·1 cites

Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

PDF

Open Access 2 Repos 2 Models

TL;DR

This paper introduces DCMHA, a novel attention mechanism that dynamically composes attention heads, enhancing Transformer expressiveness and efficiency, leading to significant performance improvements in language modeling tasks.

Contribution

We propose DCMHA, a dynamic attention head composition method that replaces standard MHA, improving model capacity and efficiency across various Transformer architectures.

Findings

01

DCMHA outperforms standard MHA in language modeling tasks.

02

DCFormer matches or exceeds performance of larger models with less compute.

03

Code and models are publicly available for replication.

Abstract

Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $Compose$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · EEG and Brain-Computer Interfaces · Anomaly Detection Techniques and Applications

MethodsDense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Softmax · Attention Is All You Need