MoH: Multi-Head Attention as Mixture-of-Head Attention
Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

TL;DR
This paper introduces Mixture-of-Head attention (MoH), an improved attention mechanism that enhances efficiency and performance by allowing tokens to select relevant heads, outperforming traditional multi-head attention with fewer heads.
Contribution
The paper proposes MoH, a novel attention architecture that treats heads as experts, enabling selective head usage and weighted summation, leading to efficiency gains and improved accuracy.
Findings
MoH outperforms traditional multi-head attention with fewer heads.
Pre-trained models like LLaMA3-8B can be fine-tuned into MoH models.
MoH achieves higher accuracy across multiple benchmarks.
Abstract
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Attention Is All You Need · Linear Layer
