TL;DR
This paper analyzes the characteristics of deep learning workloads in large GPU datacenters and proposes a data-driven resource management framework that improves scheduling efficiency and energy utilization.
Contribution
It provides a large-scale analysis of DL job traces and introduces a framework for resource management based on historical data, with practical scheduling and energy-saving services.
Findings
Quasi-Shortest-Service-First scheduling reduces job completion time by up to 6.5x
Cluster energy saving improves utilization by up to 13%
Insights into user and job behaviors inform cluster system design
Abstract
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james
