Multi-Head Mixture-of-Experts

Xun Wu; Shaohan Huang; Wenhui Wang; Furu Wei

arXiv:2404.15045·cs.CL·April 24, 2024·5 cites

Multi-Head Mixture-of-Experts

Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei

PDF

Open Access 1 Repo

TL;DR

The paper introduces Multi-Head Mixture-of-Experts (MH-MoE), a novel model that enhances expert activation and semantic analysis by splitting tokens into sub-tokens processed in parallel, improving context understanding across multiple tasks.

Contribution

It proposes a multi-head mechanism for SMoE that increases expert activation and enables fine-grained semantic analysis within tokens, with easy integration into existing models.

Findings

01

Improves expert activation in SMoE models.

02

Enhances context understanding and semantic analysis.

03

Demonstrates effectiveness across multiple language and multimodal tasks.

Abstract

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yushuiwx/mh-moe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Expert finding and Q&A systems

MethodsSparse Evolutionary Training