Online Evolutionary Batch Size Orchestration for Scheduling Deep   Learning Workloads in GPU Clusters

Zhengda Bian; Shenggui Li; Wei Wang; Yang You

arXiv:2108.03645·cs.DC·August 10, 2021·1 cites

Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters

Zhengda Bian, Shenggui Li, Wei Wang, Yang You

PDF

Open Access

TL;DR

This paper introduces ONES, an online evolutionary scheduler that dynamically adjusts batch sizes for deep learning jobs in GPU clusters, significantly improving resource utilization and reducing job completion times.

Contribution

The paper presents a novel online evolutionary approach for elastic batch size management, enhancing GPU scheduling efficiency over static policies.

Findings

01

Outperforms prior schedulers in average job completion time

02

Improves GPU utilization through dynamic batch size optimization

03

Demonstrated effectiveness on 64 GPUs in supercomputing environment

Abstract

Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACC's Longhorn supercomputers. The results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing