Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian,, Wencong Xiao, Fan Yang

TL;DR
This paper analyzes the unique challenges of scheduling and resource utilization in large multi-tenant GPU clusters for deep learning workloads, providing insights and guidelines for improving cluster efficiency.
Contribution
It offers a detailed workload characterization of enterprise GPU clusters and proposes design guidelines for next-generation schedulers tailored to DNN training workloads.
Findings
Gang scheduling and locality constraints impact queuing delays.
Locality significantly affects GPU utilization.
Failures during training influence overall cluster efficiency.
Abstract
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Stochastic Gradient Optimization Techniques
