Effective Elastic Scaling of Deep Learning Workloads
Vaibhav Saxena, K. R. Jayaram, Saurav Basu, Yogish Sabharwal and, Ashish Verma

TL;DR
This paper introduces a novel elastic scaling strategy for deep learning workloads that dynamically adjusts batch sizes and resources, significantly improving training efficiency and cluster utilization.
Contribution
It proposes a real-time optimizer for dynamic batch size and resource allocation, enhancing deep learning training performance on large-scale platforms.
Findings
Up to 2x more jobs completed compared to baseline.
Average job completion time reduced by up to 10x.
Effective resource utilization and faster training times.
Abstract
The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
