MaskMoE: Boosting Token-Level Learning via Routing Mask in   Mixture-of-Experts

Zhenpeng Su; Zijia Lin; Xue Bai; Xing Wu; Yizhe Xiong; Haoran Lian,; Guangyuan Ma; Hui Chen; Guiguang Ding; Wei Zhou; Songlin Hu

arXiv:2407.09816·cs.CL·August 30, 2024

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian,, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu

PDF

Open Access 1 Repo

TL;DR

MaskMoE introduces a routing masking technique in Mixture-of-Experts models to improve token diversity and training effectiveness, leading to better performance in language modeling and downstream tasks.

Contribution

The paper proposes MaskMoE, a novel routing mask method that enhances token-level learning and representation diversity in MoE models.

Findings

01

Outperforms previous MoE models in perplexity and downstream tasks

02

Maintains representation diversity while improving training comprehensiveness

03

Demonstrates effectiveness through extensive experiments

Abstract

Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the router, which is used to distribute each token to the experts. Currently, the mainstream routing methods include dynamic routing and fixed routing. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, though fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

suu990901/MaskMoE
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Machine Learning and Data Classification

MethodsMixture of Experts