Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer

TL;DR
This paper introduces the Switch Transformer, a simplified and efficient sparse model that enables training trillion-parameter language models with reduced communication costs, improved stability, and faster training speeds, advancing large-scale NLP capabilities.
Contribution
The paper presents a simplified routing algorithm for MoE models, training techniques for stability, and demonstrates training trillion-parameter models efficiently with lower precision formats.
Findings
Up to 7x faster pre-training with T5-based models.
Successful training of trillion-parameter models on large datasets.
Improved multilingual performance across 101 languages.
Abstract
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗reichenbach/switch-transformer-classificationmodel· 25 dl· ♡ 225 dl♡ 2
- 🤗google/switch-base-8model· 3.7k dl· ♡ 183.7k dl♡ 18
- 🤗ybelkada/switch-base-8-xsummodel· 5 dl· ♡ 35 dl♡ 3
- 🤗google/switch-base-16model· 332 dl· ♡ 4332 dl♡ 4
- 🤗google/switch-base-32model· 157 dl· ♡ 10157 dl♡ 10
- 🤗google/switch-base-64model· 82 dl· ♡ 382 dl♡ 3
- 🤗google/switch-base-128model· 522 dl· ♡ 5522 dl♡ 5
- 🤗google/switch-base-256model· 37 dl· ♡ 437 dl♡ 4
- 🤗google/switch-large-128model· 33 dl· ♡ 633 dl♡ 6
- 🤗google/switch-xxl-128model· 15 dl· ♡ 1215 dl♡ 12
Videos
Sparse Expert Models (Switch Transformers, GLAM, and more... w/ the Authors)· youtube
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity· youtube
OpenAI’s CLIP explained! | Examples, links to code and pretrained model· youtube
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsSwitch FFN · Switch Transformer
