Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters
Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay, Chidambaram

TL;DR
Synergy is a resource-sensitive scheduler for shared GPU clusters that optimizes CPU, memory, and GPU allocations based on job sensitivity, significantly improving job completion times over traditional methods.
Contribution
It introduces a novel profiling method and a near-optimal online algorithm for multi-resource-aware scheduling of DNN training jobs in multi-tenant clusters.
Findings
Up to 3.4x reduction in average job completion time.
Effective multi-resource workload-aware scheduling.
Improved resource utilization and job performance.
Abstract
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · IoT and Edge/Fog Computing
