TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn, Song, Ion Stoica

TL;DR
TeraPipe introduces token-level pipeline parallelism for training large Transformer models, enabling more fine-grained and efficient training by exploiting autoregressive properties, significantly speeding up training times.
Contribution
The paper proposes a novel token-level pipeline parallelism method for Transformer models, including a dynamic programming algorithm for optimal execution scheduling.
Findings
Achieves 5.0x speedup on GPT-3 training
Enables fine-grained pipeline parallelism within sequences
Demonstrates effectiveness on large-scale models
Abstract
Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Layer Normalization · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dense Connections · Adam · Linear Warmup With Cosine Annealing
