Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation
Maha Elbayad, Anna Sun, Shruti Bhosale

TL;DR
This paper introduces regularization strategies for MoE models in multilingual machine translation to prevent over-fitting on low-resource languages, significantly improving translation quality without harming high-resource performance.
Contribution
It proposes dropout, routing, and curriculum learning techniques specifically designed to regularize MoE models for low-resource language translation tasks.
Findings
Approximately +1 chrF++ improvement on very low-resource language pairs.
Regularization methods effectively prevent over-fitting in low-resource scenarios.
Analysis of MoE routing enhances understanding of model behavior and regularization impact.
Abstract
Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We show effective regularization strategies, namely dropout techniques for MoE layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods that prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs. We perform an extensive analysis of the learned MoE routing to better understand the impact of our regularization methods and how we can improve them.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsExcess of Mass · Dropout
