ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Xiaokai Wang; Shaoyuan Huang; Yuting Li; and Xiaofei Wang

arXiv:2511.04162·cs.LG·November 14, 2025

ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Xiaokai Wang, Shaoyuan Huang, Yuting Li, and Xiaofei Wang

PDF

Open Access

TL;DR

ScaleDL is a new framework that combines nonlinear layer-wise modeling with graph neural networks to accurately predict DNN runtimes across architectures while reducing data collection costs.

Contribution

It introduces a novel approach integrating GNNs and D-optimal data sampling to improve prediction accuracy and generalizability in distributed deep learning workloads.

Findings

01

Achieves 6x lower MRE than baselines

02

Achieves 5x lower RMSE than baselines

03

Demonstrates effectiveness across five DNN models

Abstract

Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads such as training and inference tasks impose unprecedented demands on distributed computing resources, making accurate runtime prediction essential for optimizing development and resource allocation. Traditional methods rely on additive computational unit models, limiting their accuracy and generalizability. In contrast, graph-enhanced modeling improves performance but significantly increases data collection costs. Therefore, there is a critical need for a method that strikes a balance between accuracy, generalizability, and data collection costs. To address these challenges, we propose ScaleDL, a novel runtime prediction framework that combines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Domain Adaptation and Few-Shot Learning