Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference
Joyjit Kundu, Wenzhe Guo, Ali BanaGozar, Udari De Alwis, Sourav, Sengupta, Puneet Gupta, Arindam Mallik

TL;DR
This paper presents a comprehensive performance modeling framework for distributed large language model training and inference, analyzing hardware, parallelization strategies, and technology scaling impacts to guide future system design.
Contribution
It introduces a general analytical framework that accurately predicts performance considering compute, memory, network, and parallelization, validated with industry data and applied to technology scaling analysis.
Findings
Performance bottlenecks evolve with technology scaling.
Memory footprint and activation re-computation significantly impact training speed.
DRAM scaling influences inference latency.
Abstract
Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ( 35x speed-up closely following NVIDIA's scaling trend), and further run a design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
