Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared, Casper, and Bryan Catanzaro

TL;DR
This paper introduces a simple intra-layer model parallel technique for training billion-parameter transformer models efficiently on GPUs, achieving state-of-the-art results in NLP tasks.
Contribution
It presents a novel intra-layer model parallel approach that enables training extremely large transformers without extensive infrastructure changes.
Findings
Successfully trained 8.3 billion parameter models on 512 GPUs
Achieved 15.1 PetaFLOPs with 76% scaling efficiency
Set new state-of-the-art results on multiple NLP benchmarks
Abstract
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bigscience/bloommodel· 7.4k dl· ♡ 49897.4k dl♡ 4989
- 🤗nvidia/megatron-bert-cased-345mmodel· ♡ 4♡ 4
- 🤗nvidia/megatron-bert-uncased-345mmodel· ♡ 7♡ 7
- 🤗nvidia/megatron-gpt2-345mmodel· ♡ 25♡ 25
- 🤗bigscience/bloom-560mmodel· 192k dl· ♡ 371192k dl♡ 371
- 🤗bigscience/bloom-1b1model· 6.6k dl· ♡ 666.6k dl♡ 66
- 🤗bigscience/bloom-1b7model· 55k dl· ♡ 12255k dl♡ 122
- 🤗bigscience/bloom-3bmodel· 10k dl· ♡ 9410k dl♡ 94
- 🤗bigscience/bloom-7b1model· 11k dl· ♡ 20211k dl♡ 202
- 🤗bigscience/bloom-intermediatemodel· 12 dl· ♡ 1212 dl♡ 12
Videos
Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed | Mixed Precision· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Transformer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning
