Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc, Le, Geoffrey Hinton, Jeff Dean

TL;DR
This paper introduces a sparsely-gated mixture-of-experts layer that enables neural networks to have over 1000 times more capacity with minimal efficiency loss, significantly improving language modeling and translation tasks.
Contribution
The paper presents a novel MoE layer that scales neural network capacity dramatically while maintaining computational efficiency, addressing key challenges in conditional computation.
Findings
Achieved over 1000x increase in model capacity.
Models with up to 137 billion parameters outperform state-of-the-art.
Significant improvements in language modeling and translation benchmarks.
Abstract
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
