Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training   Workloads

Myeongjae Jeon; Shivaram Venkataraman; Amar Phanishayee; Junjie Qian,; Wencong Xiao; Fan Yang

arXiv:1901.05758·cs.DC·August 9, 2019·67 cites

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian,, Wencong Xiao, Fan Yang

PDF

Open Access 1 Repo

TL;DR

This paper analyzes the unique challenges of scheduling and resource utilization in large multi-tenant GPU clusters for deep learning workloads, providing insights and guidelines for improving cluster efficiency.

Contribution

It offers a detailed workload characterization of enterprise GPU clusters and proposes design guidelines for next-generation schedulers tailored to DNN training workloads.

Findings

01

Gang scheduling and locality constraints impact queuing delays.

02

Locality significantly affects GPU utilization.

03

Failures during training influence overall cluster efficiency.

Abstract

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msr-fiddle/philly-traces
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Stochastic Gradient Optimization Techniques