A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Siddharth Singh; Prajwal Singhania; Aditya K. Ranjan; Zack Sating,; Abhinav Bhatele

arXiv:2305.13525·cs.LG·May 15, 2024·1 cites

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating,, Abhinav Bhatele

PDF

Open Access 1 Repo

TL;DR

This paper presents a 4D hybrid communication algorithm for large-scale neural network training that significantly reduces communication overhead and improves performance on thousands of GPUs.

Contribution

It introduces a novel 4D hybrid parallelism approach combined with strategies for overlapping communication and computation, along with an analytical model for efficient configuration tuning.

Findings

01

AxoNN outperforms Megatron-LM by 26% on 80-billion parameter GPT training.

02

Achieves 57% of the theoretical peak FLOP/s, totaling 182 PFLOP/s.

03

Effectively reduces communication bottlenecks in large-scale GPU training.

Abstract

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

axonn-ai/axonn
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Tensor decomposition and applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing