Multi-Head Mixture-of-Experts
Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei

TL;DR
The paper introduces Multi-Head Mixture-of-Experts (MH-MoE), a novel model that enhances expert activation and semantic analysis by splitting tokens into sub-tokens processed in parallel, improving context understanding across multiple tasks.
Contribution
It proposes a multi-head mechanism for SMoE that increases expert activation and enables fine-grained semantic analysis within tokens, with easy integration into existing models.
Findings
Improves expert activation in SMoE models.
Enhances context understanding and semantic analysis.
Demonstrates effectiveness across multiple language and multimodal tasks.
Abstract
Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Expert finding and Q&A systems
MethodsSparse Evolutionary Training
