MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
Rachel S.Y. Teo, Tan M. Nguyen

TL;DR
This paper introduces MomentumSMoE, a new approach that incorporates momentum into Sparse Mixture of Experts to improve training stability and robustness, demonstrated on vision and language tasks.
Contribution
We establish a theoretical connection between SMoE dynamics and gradient descent, then integrate momentum to enhance stability and robustness, applicable to various SMoE models.
Findings
MomentumSMoE is more stable than traditional SMoE.
MomentumSMoE improves robustness against data contamination.
Applicable to vision and language models with minimal additional cost.
Abstract
Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model's lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Time Series Analysis and Forecasting · Gaussian Processes and Bayesian Inference
MethodsAdam · Mixture of Experts
