MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

Rachel S.Y. Teo; Tan M. Nguyen

arXiv:2410.14574·cs.LG·October 21, 2024

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

Rachel S.Y. Teo, Tan M. Nguyen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MomentumSMoE, a new approach that incorporates momentum into Sparse Mixture of Experts to improve training stability and robustness, demonstrated on vision and language tasks.

Contribution

We establish a theoretical connection between SMoE dynamics and gradient descent, then integrate momentum to enhance stability and robustness, applicable to various SMoE models.

Findings

01

MomentumSMoE is more stable than traditional SMoE.

02

MomentumSMoE improves robustness against data contamination.

03

Applicable to vision and language models with minimal additional cost.

Abstract

Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model's lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rachtsy/momentumsmoe
pytorchOfficial

Videos

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts· slideslive

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Time Series Analysis and Forecasting · Gaussian Processes and Bayesian Inference

MethodsAdam · Mixture of Experts