MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi, Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He,, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo, Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie

TL;DR
MegaScale demonstrates a full-stack system design for training large language models on over 10,000 GPUs, achieving high efficiency and stability through innovative system and algorithm co-design, diagnostics, and fault tolerance.
Contribution
This work introduces MegaScale, a scalable, efficient, and stable system for training LLMs at unprecedented GPU scale, with novel diagnostics and optimization techniques.
Findings
Achieved 55.2% Model FLOPs Utilization on 12,288 GPUs.
Improved MFU by 1.34x over Megatron-LM.
Developed diagnostic tools for stability and fault tolerance.
Abstract
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
