A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
Ajay Navilarekal Rajgopal, Nikolai Solmsdorf

TL;DR
This paper presents an efficient, scalable training recipe for large language models up to 175 billion parameters on the SuperMUC-NG supercomputer, optimizing parallelism techniques for high throughput.
Contribution
It introduces a comprehensive, accessible training strategy combining tensor, pipeline, and sharded data parallelism for large-scale LLMs on HPC systems.
Findings
Achieved 10% of theoretical peak FLOPs per GPU tile during training.
Demonstrated 93% weak scaling efficiency on 128 nodes.
Maintained 82% strong scaling efficiency across the system.
Abstract
Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter GPT-style model requires an estimated 120 million exaflops. This challenge necessitates efficient distributed training strategies on cutting-edge High-Performance Computing (HPC) infrastructure. In this work, we explore the SuperMUC-NG Phase 2 (SMNG-P2) system at the Leibniz Supercomputing Centre (LRZ) in Garching, Germany, equipped with Intel Data Center GPU Max 1550 accelerators to extract the necessary computational power. We enable and investigate a comprehensive recipe of parallel training techniques, including tensor parallelism, pipeline parallelism, and sharded data parallelism, essential for facilitating the training of LLMs up to 175…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
