ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph; Irwan Bello; Sameer Kumar; Nan Du; Yanping Huang; Jeff; Dean; Noam Shazeer; William Fedus

arXiv:2202.08906·cs.CL·May 3, 2022·48 cites

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff, Dean, Noam Shazeer, William Fedus

PDF

Open Access 3 Repos 10 Models

TL;DR

This paper introduces ST-MoE, a scalable and stable sparse expert model that achieves state-of-the-art transfer learning performance across diverse NLP tasks while maintaining computational efficiency.

Contribution

It provides a design guide to address training instability in MoE models and demonstrates scaling a sparse model to 269B parameters with high transferability.

Findings

01

Achieved state-of-the-art transfer performance on multiple NLP benchmarks.

02

Scaled a sparse model to 269B parameters with efficiency comparable to smaller dense models.

03

Addressed training stability issues in large sparse models.

Abstract

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMixture of Experts · Multi-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam · Dropout · Absolute Position Encodings