Adaptive Gating in Mixture-of-Experts based Language Models
Jiamin Li, Qiang Su, Yitao Yang, Yimin Jiang, Cong Wang, Hong Xu

TL;DR
This paper proposes adaptive gating in mixture-of-experts models for NLP, allowing variable expert utilization per token, which reduces training time by up to 22.5% while maintaining performance.
Contribution
It introduces a novel adaptive gating mechanism for MoE models that dynamically adjusts expert usage per token, enhancing training efficiency and preserving sparsity.
Findings
Reduces training time by up to 22.5%.
Maintains inference quality comparable to fixed gating models.
Provides insights into routing decisions with adaptive gating.
Abstract
Large language models, such as OpenAI's ChatGPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
