MoH: Multi-Head Attention as Mixture-of-Head Attention

Peng Jin; Bo Zhu; Li Yuan; Shuicheng Yan

arXiv:2410.11842·cs.CV·December 2, 2025·3 cites

MoH: Multi-Head Attention as Mixture-of-Head Attention

Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

PDF

Open Access 3 Repos 6 Models

TL;DR

This paper introduces Mixture-of-Head attention (MoH), an improved attention mechanism that enhances efficiency and performance by allowing tokens to select relevant heads, outperforming traditional multi-head attention with fewer heads.

Contribution

The paper proposes MoH, a novel attention architecture that treats heads as experts, enabling selective head usage and weighted summation, leading to efficiency gains and improved accuracy.

Findings

01

MoH outperforms traditional multi-head attention with fewer heads.

02

Pre-trained models like LLaMA3-8B can be fine-tuned into MoH models.

03

MoH achieves higher accuracy across multiple benchmarks.

Abstract

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks

MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Attention Is All You Need · Linear Layer