Improving Transformers with Dynamically Composable Multi-Head Attention
Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

TL;DR
This paper introduces DCMHA, a novel attention mechanism that dynamically composes attention heads, enhancing Transformer expressiveness and efficiency, leading to significant performance improvements in language modeling tasks.
Contribution
We propose DCMHA, a dynamic attention head composition method that replaces standard MHA, improving model capacity and efficiency across various Transformer architectures.
Findings
DCMHA outperforms standard MHA in language modeling tasks.
DCFormer matches or exceeds performance of larger models with less compute.
Code and models are publicly available for replication.
Abstract
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · EEG and Brain-Computer Interfaces · Anomaly Detection Techniques and Applications
MethodsDense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Softmax · Attention Is All You Need
