Accelerating Large Language Model Training with 4D Parallelism and   Memory Consumption Estimator

Kazuki Fujii; Kohei Watanabe; Rio Yokota

arXiv:2411.06465·cs.LG·November 12, 2024

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

Kazuki Fujii, Kohei Watanabe, Rio Yokota

PDF

Open Access

TL;DR

This paper introduces precise formulas for estimating memory usage in 4D parallelism for large language model training, enabling efficient configuration selection and reducing memory overflow risks.

Contribution

It provides the first detailed memory consumption formulas for 4D parallelism in LLM training, incorporating factors like memory fragmentation and temporary buffers.

Findings

01

Memory usage below 80% of GPU capacity prevents out-of-memory errors.

02

The formulas effectively predict memory overflow, guiding optimal parallelization configurations.

03

Empirical analysis reveals insights into optimal 4D parallelism setups.

Abstract

In large language model (LLM) training, several parallelization strategies, including Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), as well as Sequence Parallelism (SP) and Context Parallelism (CP), are employed to distribute model parameters, activations, and optimizer states across devices. Identifying the optimal parallelization configuration for each environment while avoiding GPU memory overflow remains a challenging task. In this study, we provide precise formulas to estimate the memory consumed by parameters, gradients, optimizer states, and activations for 4D parallel training (DP, TP, PP, CP) in the Llama architecture. We conducted 454 experiments on A100 and H100 GPUs, incorporating often neglected factors such as temporary buffers and memory fragmentation into our analysis. Results indicate that when the estimated memory usage is below 80\% of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsFragmentation · LLaMA