MoEC: Mixture of Expert Clusters

Yuan Xie; Shaohan Huang; Tianyu Chen; Furu Wei

arXiv:2207.09094·cs.CL·July 20, 2022

MoEC: Mixture of Expert Clusters

Yuan Xie, Shaohan Huang, Tianyu Chen, Furu Wei

PDF

Open Access 1 Video

TL;DR

MoEC introduces a clustering approach with variance constraints and dropout to improve the scalability and performance of Mixture of Experts models, especially on limited data tasks.

Contribution

The paper proposes Mixture of Expert Clusters (MoEC), a novel method that enhances MoE models by encouraging diverse expert learning and reducing overfitting through clustering and dropout strategies.

Findings

01

MoEC improves performance on machine translation tasks.

02

MoEC raises the performance upper bound for scaled experts with limited data.

03

MoEC mitigates overfitting and sparse data allocation issues.

Abstract

Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress for MoE models to improve performance by scaling up. In this work, we propose Mixture of Expert Clusters - a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. We further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MoEC: Mixture of Expert Clusters· underline

Taxonomy

TopicsExpert finding and Q&A systems · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsDropout