Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Zhengda Bian, Shenggui Li, Wei Wang, Yang You

TL;DR
This paper introduces ONES, an online evolutionary scheduler that dynamically adjusts batch sizes for deep learning jobs in GPU clusters, significantly improving resource utilization and reducing job completion times.
Contribution
The paper presents a novel online evolutionary approach for elastic batch size management, enhancing GPU scheduling efficiency over static policies.
Findings
Outperforms prior schedulers in average job completion time
Improves GPU utilization through dynamic batch size optimization
Demonstrated effectiveness on 64 GPUs in supercomputing environment
Abstract
Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACC's Longhorn supercomputers. The results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing
