Characterization and Prediction of Deep Learning Workloads in   Large-Scale GPU Datacenters

Qinghao Hu; Peng Sun; Shengen Yan; Yonggang Wen; Tianwei Zhang

arXiv:2109.01313·cs.DC·September 7, 2021

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, Tianwei Zhang

PDF

1 Repo

TL;DR

This paper analyzes the characteristics of deep learning workloads in large GPU datacenters and proposes a data-driven resource management framework that improves scheduling efficiency and energy utilization.

Contribution

It provides a large-scale analysis of DL job traces and introduces a framework for resource management based on historical data, with practical scheduling and energy-saving services.

Findings

01

Quasi-Shortest-Service-First scheduling reduces job completion time by up to 6.5x

02

Cluster energy saving improves utilization by up to 13%

03

Insights into user and job behaviors inform cluster system design

Abstract

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s-lab-system-group/heliosdata
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methodstravel james