GRIN: GRadient-INformed MoE
Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao, Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav, Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng, Gao, Weizhu Chen

TL;DR
GRIN introduces a gradient-informed training method for MoE models, enabling efficient sparse routing and improved scaling, resulting in models that outperform or match larger dense models across various tasks.
Contribution
The paper presents GRIN, a novel training approach for MoE models that incorporates sparse gradient estimation and optimized parallelism, enhancing scalability and performance.
Findings
A 6.6B parameter MoE model surpasses a 7B dense model on multiple benchmarks.
GRIN enables training of large MoE models without token dropping.
Extensive evaluations show significant performance improvements across diverse tasks.
Abstract
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 163.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/Phi-mini-MoE-instructmodel· 100k dl· ♡ 32100k dl♡ 32
- 🤗microsoft/GRIN-MoEmodel· 266 dl· ♡ 200266 dl♡ 200
- 🤗alexbuz/GRIN-MoE-2model· 4 dl4 dl
- 🤗alexbuz/GRIN-MoEmodel· 1 dl1 dl
- 🤗microsoft/Phi-tiny-MoE-instructmodel· 548k dl· ♡ 35548k dl♡ 35
- 🤗gabriellarson/Phi-mini-MoE-instruct-GGUFmodel· 2.7k dl· ♡ 72.7k dl♡ 7
- 🤗FriendliAI/Phi-mini-MoE-instructmodel· 238 dl· ♡ 1238 dl♡ 1
- 🤗FriendliAI/Phi-tiny-MoE-instructmodel· 32 dl32 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Graph Neural Networks
MethodsGraph Recurrent Imputation Network · Mixture of Experts
