Elastic deep learning in multi-tenant GPU cluster

Yidi Wu; Kaihao Ma; Xiao Yan; Zhi Liu; Zhenkun Cai; Yuzhen Huang,; James Cheng; Han Yuan; Fan Yu

arXiv:1909.11985·cs.DC·December 3, 2019·1 cites

Elastic deep learning in multi-tenant GPU cluster

Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang,, James Cheng, Han Yuan, Fan Yu

PDF

Open Access

TL;DR

This paper introduces EDL, a low-overhead elastic deep learning framework that dynamically adjusts GPU parallelism to improve cluster efficiency and job performance.

Contribution

The paper presents EDL, a novel approach with techniques like stop-free scaling and dynamic data pipelines to enable efficient elasticity in multi-tenant GPU clusters.

Findings

01

EDL reduces overhead of parallelism adjustments

02

EDL improves GPU cluster utilization

03

EDL benefits scheduling and resource management

Abstract

We study how to support elasticity, i.e., the ability to dynamically adjust the parallelism (number of GPUs), for deep neural network (DNN) training. Elasticity can benefit multi-tenant GPU cluster management in many ways, e.g., achieving various scheduling objectives (e.g., job throughput, job completion time, GPU efficiency) according to cluster load variations, maximizing the use of transient idle resources, performance profiling, job migration, and straggler mitigation. However, existing parallelism adjustment strategies incur high overheads, which hinder many applications from making effective use of elasticity. We propose EDL to enable low-overhead elastic deep learning with a simple API. We present techniques that are necessary to reduce the overhead of parallelism adjustments, such as stop-free scaling and dynamic data pipeline. We also demonstrate that EDL can indeed bring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Stochastic Gradient Optimization Techniques · Advanced Data Storage Technologies