Characterizing Deep Learning Training Workloads on Alibaba-PAI

Mengdi Wang; Chen Meng; Guoping Long; Chuan Wu; Jun Yang; Wei Lin,; Yangqing Jia

arXiv:1910.05930·cs.PF·October 15, 2019·5 cites

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin,, Yangqing Jia

PDF

Open Access

TL;DR

This paper analyzes deep learning training workloads on Alibaba's PAI platform, identifying communication as the main bottleneck and exploring hardware/software optimizations for performance improvements.

Contribution

It provides a detailed characterization of training workloads, performance bottlenecks, and potential hardware/software optimizations for Alibaba's AI cloud.

Findings

01

Weight/gradient communication accounts for 62% of execution time.

02

Computation is not the primary bottleneck.

03

Upgrading interconnects and architecture can significantly improve performance.

Abstract

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Adversarial Robustness in Machine Learning