Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley,, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George, Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi,, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He

TL;DR
This paper details the training of the largest monolithic transformer language model, MT-NLG 530B, using advanced parallelism and data curation techniques, achieving state-of-the-art results in NLP benchmarks.
Contribution
It introduces a novel large-scale training infrastructure and methodology for a 530-billion-parameter language model, demonstrating superior performance and new properties.
Findings
Achieved state-of-the-art zero-, one-, and few-shot NLP performance.
Developed a scalable 3D parallelism training methodology.
Provided insights into data curation for large-scale models.
Abstract
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
