Optimizing Distributed Training on Frontier for Large Language Models

Sajal Dash; Isaac Lyngaas; Junqi Yin; Xiao Wang; Romain Egele; Guojing; Cong; Feiyi Wang; Prasanna Balaprakash

arXiv:2312.12705·cs.DC·December 25, 2023·1 cites

Optimizing Distributed Training on Frontier for Large Language Models

Sajal Dash, Isaac Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, Guojing, Cong, Feiyi Wang, Prasanna Balaprakash

PDF

Open Access

TL;DR

This paper investigates efficient distributed training strategies for large language models on the Frontier supercomputer, optimizing techniques like tensor, pipeline, and sharded data parallelism to improve throughput and scaling efficiency.

Contribution

It introduces a comprehensive empirical analysis of combined parallelism techniques for training trillion-parameter models on exascale hardware, with optimized hyperparameters.

Findings

01

Achieved GPU throughputs of 38.38%, 36.14%, and 31.96% for 22B, 175B, and 1T models.

02

Attained 100% weak scaling efficiency on 1024 and 3072 GPUs for 175B and 1T models.

03

Secured strong scaling efficiencies of 89% and 87% for the large models.

Abstract

Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops of computation. This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer dedicated to open science. We enable and investigate various model and data parallel training techniques, such as tensor parallelism, pipeline parallelism, and sharded data parallelism,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis