ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads
Xiaokai Wang, Shaoyuan Huang, Yuting Li, and Xiaofei Wang

TL;DR
ScaleDL is a new framework that combines nonlinear layer-wise modeling with graph neural networks to accurately predict DNN runtimes across architectures while reducing data collection costs.
Contribution
It introduces a novel approach integrating GNNs and D-optimal data sampling to improve prediction accuracy and generalizability in distributed deep learning workloads.
Findings
Achieves 6x lower MRE than baselines
Achieves 5x lower RMSE than baselines
Demonstrates effectiveness across five DNN models
Abstract
Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads such as training and inference tasks impose unprecedented demands on distributed computing resources, making accurate runtime prediction essential for optimizing development and resource allocation. Traditional methods rely on additive computational unit models, limiting their accuracy and generalizability. In contrast, graph-enhanced modeling improves performance but significantly increases data collection costs. Therefore, there is a critical need for a method that strikes a balance between accuracy, generalizability, and data collection costs. To address these challenges, we propose ScaleDL, a novel runtime prediction framework that combines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Domain Adaptation and Few-Shot Learning
