ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff, Dean, Noam Shazeer, William Fedus

TL;DR
This paper introduces ST-MoE, a scalable and stable sparse expert model that achieves state-of-the-art transfer learning performance across diverse NLP tasks while maintaining computational efficiency.
Contribution
It provides a design guide to address training instability in MoE models and demonstrates scaling a sparse model to 269B parameters with high transferability.
Findings
Achieved state-of-the-art transfer performance on multiple NLP benchmarks.
Scaled a sparse model to 269B parameters with efficiency comparable to smaller dense models.
Addressed training stability issues in large sparse models.
Abstract
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗chargoddard/MixtralRPChat-ZLossmodel· 9 dl· ♡ 269 dl♡ 26
- 🤗LoneStriker/MixtralRPChat-ZLoss-2.4bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/MixtralRPChat-ZLoss-3.0bpw-h6-exl2model· 4 dl· ♡ 14 dl♡ 1
- 🤗LoneStriker/MixtralRPChat-ZLoss-3.5bpw-h6-exl2model· 2 dl· ♡ 42 dl♡ 4
- 🤗LoneStriker/MixtralRPChat-ZLoss-4.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/MixtralRPChat-ZLoss-5.0bpw-h6-exl2model· 4 dl4 dl
- 🤗LoneStriker/MixtralRPChat-ZLoss-6.0bpw-h6-exl2model· 3 dl· ♡ 13 dl♡ 1
- 🤗TheBloke/MixtralRPChat-ZLoss-GGUFmodel· 150 dl· ♡ 9150 dl♡ 9
- 🤗TheBloke/MixtralRPChat-ZLoss-AWQmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗TheBloke/MixtralRPChat-ZLoss-GPTQmodel· 6 dl· ♡ 16 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMixture of Experts · Multi-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam · Dropout · Absolute Position Encodings
