Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
Lang Xu, Quentin Anthony, Jacob Hatef, Aamir Shafi, Hari Subramoni,, Dhabaleswar K. (DK) Panda

TL;DR
This paper introduces a hierarchical partitioning strategy for ZeRO++ to optimize large language model training on Frontier, significantly reducing communication costs and boosting GPU performance.
Contribution
It proposes a novel 3-level hierarchical partitioning tailored for Frontier's hardware topology to enhance training efficiency of large models.
Findings
Achieved 1.71x increase in TFLOPS per GPU for a 20B GPT model.
Demonstrated a scaling efficiency of 0.94 up to 384 GCDs.
Reduced communication overhead through targeted hierarchical partitioning.
Abstract
Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Dense Connections · Attention Dropout · Cosine Annealing · Softmax
