Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
Siddharth Singh, Prajwal Singhania, Aditya Ranjan, John Kirchenbauer,, Jonas Geiping, Yuxin Wen, Neel Jain, Abhimanyu Hans, Manli Shu, Aditya Tomar,, Tom Goldstein, Abhinav Bhatele

TL;DR
This paper introduces AxoNN, an open-source framework enabling scalable training of large language models on supercomputers, achieving unprecedented performance and addressing privacy risks like data memorization.
Contribution
AxoNN provides a novel four-dimensional hybrid parallel algorithm and performance optimizations for scalable LLM training on GPU supercomputers, with demonstrated high efficiency and privacy safeguards.
Findings
Achieved record-breaking training speeds on Perlmutter, Frontier, and Alps supercomputers.
Demonstrated fine-tuning of a 405-billion parameter model.
Explored and mitigated catastrophic memorization risks in large models.
Abstract
Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve matrix multiply kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s (bf16) for training of GPT-style transformer models on Perlmutter (620.1 Petaflop/s), Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s). While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices
