Efficient Large-Scale Language Model Training on GPU Clusters Using   Megatron-LM

Deepak Narayanan; Mohammad Shoeybi; Jared Casper; Patrick LeGresley,; Mostofa Patwary; Vijay Anand Korthikanti; Dmitri Vainbrand; Prethvi; Kashinkunti; Julie Bernauer; Bryan Catanzaro; Amar Phanishayee; Matei Zaharia

arXiv:2104.04473·cs.CL·August 25, 2021

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley,, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi, Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia

PDF

3 Repos 1 Models

TL;DR

This paper presents Megatron-LM, a scalable approach combining tensor, pipeline, and data parallelism to efficiently train trillion-parameter language models on GPU clusters, significantly improving throughput and scalability.

Contribution

It introduces a novel interleaved pipeline parallelism schedule and demonstrates scalable training of trillion-parameter models on thousands of GPUs.

Findings

01

Achieved 502 petaFLOP/s on 3072 GPUs.

02

Trained a 1 trillion parameter model with 52% of theoretical peak throughput.

03

Improved pipeline throughput by over 10% with minimal memory overhead.

Abstract

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
theonlyengine/flash-attention
model· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.