Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
Feng Liang, Zhen Zhang, Haifeng Lu, Chengming Li, Victor C. M. Leung,, Yanyi Guo, Xiping Hu

TL;DR
This survey reviews recent strategies and challenges in resource allocation and workload scheduling for large-scale distributed deep learning, highlighting key insights and practical case studies to guide future research.
Contribution
It provides a comprehensive overview of recent advances in resource management and scheduling strategies specifically for large-scale distributed deep learning environments.
Findings
Identifies key challenges like scheduling complexity and heterogeneity.
Highlights effective strategies for resource allocation and workload scheduling.
Includes a case study on training large language models.
Abstract
With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep learning. The large-scale environment with large volumes of datasets, models, and computational and communication resources raises various unique challenges for resource allocation and workload scheduling in distributed deep learning, such as scheduling complexity, resource and workload heterogeneity, and fault tolerance. To uncover these challenges and corresponding solutions, this survey reviews the literature, mainly from 2019 to 2024, on efficient resource allocation and workload scheduling strategies for large-scale distributed DL. We explore these strategies by focusing on various resource types, scheduling granularity levels, and performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
