Simulating LLM training workloads for heterogeneous compute and network infrastructure
Sumit Kumar, Arjun Temura, Naman Sharma, Ramanjeet Singh, Meet Dadhania, Praveen Tammana, Satananda Burla, Abed Mohammad Kamaluddin, Rinku Shah

TL;DR
This paper introduces a heterogeneity-aware LLM training simulator that accurately predicts training times in diverse, real-world GPU cluster environments, addressing limitations of existing homogeneous assumptions.
Contribution
The paper presents the design and initial implementation of a distributed LLM simulator that accounts for device heterogeneity in compute and network, improving realism and utility.
Findings
Heterogeneity significantly impacts training time components.
The simulator can model custom device group configurations.
Initial results show accurate prediction of computation and communication times.
Abstract
The growing demand for large-scale GPU clusters in distributed model training presents a significant barrier to innovation, particularly in model optimization, performance tuning, and system-level enhancements. To address this challenge, LLM training simulators are employed to estimate training time and guide design decisions. However, the state-of-the-art LLM training simulators assume homogeneous compute and network infrastructure. In practice, device heterogeneity is inevitable due to resource sharing in cloud environments, frequent shifts in device generations, and inherent intra-chip interconnect heterogeneity. To address the gap between state-of-the-art and practical requirements, we propose the design of a heterogeneity-aware distributed LLM simulator capable of predicting training time while enabling abstractions to specify custom configurations for device groups and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
