Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley,, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi, Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia

TL;DR
This paper presents Megatron-LM, a scalable approach combining tensor, pipeline, and data parallelism to efficiently train trillion-parameter language models on GPU clusters, significantly improving throughput and scalability.
Contribution
It introduces a novel interleaved pipeline parallelism schedule and demonstrates scalable training of trillion-parameter models on thousands of GPUs.
Findings
Achieved 502 petaFLOP/s on 3072 GPUs.
Trained a 1 trillion parameter model with 52% of theoretical peak throughput.
Improved pipeline throughput by over 10% with minimal memory overhead.
Abstract
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
