GPipe: Efficient Training of Giant Neural Networks using Pipeline   Parallelism

Yanping Huang; Youlong Cheng; Ankur Bapna; Orhan Firat; Mia Xu Chen,; Dehao Chen; HyoukJoong Lee; Jiquan Ngiam; Quoc V. Le; Yonghui Wu; Zhifeng; Chen

arXiv:1811.06965·cs.CV·July 29, 2019·236 cites

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen,, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng, Chen

PDF

Open Access 5 Repos

TL;DR

GPipe introduces a pipeline parallelism method enabling efficient training of extremely large neural networks across multiple accelerators, improving scalability and performance for diverse tasks.

Contribution

The paper presents GPipe, a flexible pipeline parallelism library with a novel batch-splitting algorithm that achieves near-linear speedup for large-scale neural network training.

Findings

01

Trained a 557-million-parameter model with 84.4% top-1 accuracy on ImageNet.

02

Successfully trained a 6-billion-parameter multilingual Transformer surpassing bilingual models.

03

Demonstrated efficient scaling and training of large models across different architectures.

Abstract

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · GPipe · Spatially Separable Convolution · Max Pooling · Convolution · Average Pooling · AmoebaNet · Residual Connection