Switch Transformers: Scaling to Trillion Parameter Models with Simple   and Efficient Sparsity

William Fedus; Barret Zoph; Noam Shazeer

arXiv:2101.03961·cs.LG·June 20, 2022·361 cites

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, Noam Shazeer

PDF

Open Access 5 Repos 10 Models 3 Videos

TL;DR

This paper introduces the Switch Transformer, a simplified and efficient sparse model that enables training trillion-parameter language models with reduced communication costs, improved stability, and faster training speeds, advancing large-scale NLP capabilities.

Contribution

The paper presents a simplified routing algorithm for MoE models, training techniques for stability, and demonstrates training trillion-parameter models efficiently with lower precision formats.

Findings

01

Up to 7x faster pre-training with T5-based models.

02

Successful training of trillion-parameter models on large datasets.

03

Improved multilingual performance across 101 languages.

Abstract

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Sparse Expert Models (Switch Transformers, GLAM, and more... w/ the Authors)· youtube

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity· youtube

OpenAI’s CLIP explained! | Examples, links to code and pretrained model· youtube

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsSwitch FFN · Switch Transformer