Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng,, John D. Owens

TL;DR
This paper presents a comprehensive performance modeling framework for multi-GPU machine learning training, accurately predicting training times and aiding configuration choices across diverse workloads and hardware setups.
Contribution
It introduces a data-distribution-aware performance model and communication prediction techniques, extending prior single-GPU work to multi-GPU environments with high accuracy.
Findings
Predicts per-iteration training time with 5.21% error for DLRM models.
Generalizes well to Transformer-based NLP models with 3.00% error.
Enables quick selection of optimal embedding sharding configurations with 85% success rate.
Abstract
Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Online Learning and Analytics · Distributed and Parallel Computing Systems
