Effective Elastic Scaling of Deep Learning Workloads

Vaibhav Saxena; K. R. Jayaram; Saurav Basu; Yogish Sabharwal and; Ashish Verma

arXiv:2006.13878·cs.DC·June 25, 2020

Effective Elastic Scaling of Deep Learning Workloads

Vaibhav Saxena, K. R. Jayaram, Saurav Basu, Yogish Sabharwal and, Ashish Verma

PDF

TL;DR

This paper introduces a novel elastic scaling strategy for deep learning workloads that dynamically adjusts batch sizes and resources, significantly improving training efficiency and cluster utilization.

Contribution

It proposes a real-time optimizer for dynamic batch size and resource allocation, enhancing deep learning training performance on large-scale platforms.

Findings

01

Up to 2x more jobs completed compared to baseline.

02

Average job completion time reduced by up to 10x.

03

Effective resource utilization and faster training times.

Abstract

The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.