Megatron-LM: Training Multi-Billion Parameter Language Models Using   Model Parallelism

Mohammad Shoeybi; Mostofa Patwary; Raul Puri; Patrick LeGresley; Jared; Casper; and Bryan Catanzaro

arXiv:1909.08053·cs.CL·March 17, 2020·824 cites

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared, Casper, and Bryan Catanzaro

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

This paper introduces a simple intra-layer model parallel technique for training billion-parameter transformer models efficiently on GPUs, achieving state-of-the-art results in NLP tasks.

Contribution

It presents a novel intra-layer model parallel approach that enables training extremely large transformers without extensive infrastructure changes.

Findings

01

Successfully trained 8.3 billion parameter models on 512 GPUs

02

Achieved 15.1 PetaFLOPs with 76% scaling efficiency

03

Set new state-of-the-art results on multiple NLP benchmarks

Abstract

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed | Mixed Precision· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Transformer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning