Scaling Large Language Model Training on Frontier with Low-Bandwidth   Partitioning

Lang Xu; Quentin Anthony; Jacob Hatef; Aamir Shafi; Hari Subramoni,; Dhabaleswar K. (DK) Panda

arXiv:2501.04266·cs.DC·February 5, 2025

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Lang Xu, Quentin Anthony, Jacob Hatef, Aamir Shafi, Hari Subramoni,, Dhabaleswar K. (DK) Panda

PDF

Open Access

TL;DR

This paper introduces a hierarchical partitioning strategy for ZeRO++ to optimize large language model training on Frontier, significantly reducing communication costs and boosting GPU performance.

Contribution

It proposes a novel 3-level hierarchical partitioning tailored for Frontier's hardware topology to enhance training efficiency of large models.

Findings

01

Achieved 1.71x increase in TFLOPS per GPU for a 20B GPT model.

02

Demonstrated a scaling efficiency of 0.94 up to 384 GCDs.

03

Reduced communication overhead through targeted hierarchical partitioning.

Abstract

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Dense Connections · Attention Dropout · Cosine Annealing · Softmax