Optimizing Distributed Training on Frontier for Large Language Models
Sajal Dash, Isaac Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, Guojing, Cong, Feiyi Wang, Prasanna Balaprakash

TL;DR
This paper investigates efficient distributed training strategies for large language models on the Frontier supercomputer, optimizing techniques like tensor, pipeline, and sharded data parallelism to improve throughput and scaling efficiency.
Contribution
It introduces a comprehensive empirical analysis of combined parallelism techniques for training trillion-parameter models on exascale hardware, with optimized hyperparameters.
Findings
Achieved GPU throughputs of 38.38%, 36.14%, and 31.96% for 22B, 175B, and 1T models.
Attained 100% weak scaling efficiency on 1024 and 3072 GPUs for 175B and 1T models.
Secured strong scaling efficiencies of 89% and 87% for the large models.
Abstract
Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops of computation. This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer dedicated to open science. We enable and investigate various model and data parallel training techniques, such as tensor parallelism, pipeline parallelism, and sharded data parallelism,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
