MoEC: Mixture of Expert Clusters
Yuan Xie, Shaohan Huang, Tianyu Chen, Furu Wei

TL;DR
MoEC introduces a clustering approach with variance constraints and dropout to improve the scalability and performance of Mixture of Experts models, especially on limited data tasks.
Contribution
The paper proposes Mixture of Expert Clusters (MoEC), a novel method that enhances MoE models by encouraging diverse expert learning and reducing overfitting through clustering and dropout strategies.
Findings
MoEC improves performance on machine translation tasks.
MoEC raises the performance upper bound for scaled experts with limited data.
MoEC mitigates overfitting and sparse data allocation issues.
Abstract
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress for MoE models to improve performance by scaling up. In this work, we propose Mixture of Expert Clusters - a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. We further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExpert finding and Q&A systems · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsDropout
