MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian,, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu

TL;DR
MaskMoE introduces a routing masking technique in Mixture-of-Experts models to improve token diversity and training effectiveness, leading to better performance in language modeling and downstream tasks.
Contribution
The paper proposes MaskMoE, a novel routing mask method that enhances token-level learning and representation diversity in MoE models.
Findings
Outperforms previous MoE models in perplexity and downstream tasks
Maintains representation diversity while improving training comprehensiveness
Demonstrates effectiveness through extensive experiments
Abstract
Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the router, which is used to distribute each token to the experts. Currently, the mainstream routing methods include dynamic routing and fixed routing. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, though fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Machine Learning and Data Classification
MethodsMixture of Experts
