Outrageously Large Neural Networks: The Sparsely-Gated   Mixture-of-Experts Layer

Noam Shazeer; Azalia Mirhoseini; Krzysztof Maziarz; Andy Davis; Quoc; Le; Geoffrey Hinton; Jeff Dean

arXiv:1701.06538·cs.LG·January 24, 2017·268 cites

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc, Le, Geoffrey Hinton, Jeff Dean

PDF

Open Access 4 Repos 3 Models

TL;DR

This paper introduces a sparsely-gated mixture-of-experts layer that enables neural networks to have over 1000 times more capacity with minimal efficiency loss, significantly improving language modeling and translation tasks.

Contribution

The paper presents a novel MoE layer that scales neural network capacity dramatically while maintaining computational efficiency, addressing key challenges in conditional computation.

Findings

01

Achieved over 1000x increase in model capacity.

02

Models with up to 137 billion parameters outperform state-of-the-art.

03

Significant improvements in language modeling and translation benchmarks.

Abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory