Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
Kazuki Fujii, Kohei Watanabe, Rio Yokota

TL;DR
This paper introduces precise formulas for estimating memory usage in 4D parallelism for large language model training, enabling efficient configuration selection and reducing memory overflow risks.
Contribution
It provides the first detailed memory consumption formulas for 4D parallelism in LLM training, incorporating factors like memory fragmentation and temporary buffers.
Findings
Memory usage below 80% of GPU capacity prevents out-of-memory errors.
The formulas effectively predict memory overflow, guiding optimal parallelization configurations.
Empirical analysis reveals insights into optimal 4D parallelism setups.
Abstract
In large language model (LLM) training, several parallelization strategies, including Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), as well as Sequence Parallelism (SP) and Context Parallelism (CP), are employed to distribute model parameters, activations, and optimizer states across devices. Identifying the optimal parallelization configuration for each environment while avoiding GPU memory overflow remains a challenging task. In this study, we provide precise formulas to estimate the memory consumed by parameters, gradients, optimizer states, and activations for 4D parallel training (DP, TP, PP, CP) in the Llama architecture. We conducted 454 experiments on A100 and H100 GPUs, incorporating often neglected factors such as temporary buffers and memory fragmentation into our analysis. Results indicate that when the estimated memory usage is below 80\% of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsFragmentation · LLaMA
