MegaScale: Scaling Large Language Model Training to More Than 10,000   GPUs

Ziheng Jiang; Haibin Lin; Yinmin Zhong; Qi Huang; Yangrui Chen; Zhi; Zhang; Yanghua Peng; Xiang Li; Cong Xie; Shibiao Nong; Yulu Jia; Sun He,; Hongmin Chen; Zhihao Bai; Qi Hou; Shipeng Yan; Ding Zhou; Yiyao Sheng; Zhuo; Jiang; Haohan Xu; Haoran Wei; Zhang Zhang; Pengfei Nie; Leqi Zou; Sida Zhao,; Liang Xiang; Zherui Liu; Zhe Li; Xiaoying Jia; Jianxi Ye; Xin Jin; Xin Liu

arXiv:2402.15627·cs.LG·February 27, 2024·24 cites

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi, Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He,, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo, Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie

PDF

Open Access 1 Repo

TL;DR

MegaScale demonstrates a full-stack system design for training large language models on over 10,000 GPUs, achieving high efficiency and stability through innovative system and algorithm co-design, diagnostics, and fault tolerance.

Contribution

This work introduces MegaScale, a scalable, efficient, and stable system for training LLMs at unprecedented GPU scale, with novel diagnostics and optimization techniques.

Findings

01

Achieved 55.2% Model FLOPs Utilization on 12,288 GPUs.

02

Improved MFU by 1.34x over Megatron-LM.

03

Developed diagnostic tools for stability and fault tolerance.

Abstract

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

volcengine/vescale
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training