Characterizing Deep Learning Training Workloads on Alibaba-PAI
Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin,, Yangqing Jia

TL;DR
This paper analyzes deep learning training workloads on Alibaba's PAI platform, identifying communication as the main bottleneck and exploring hardware/software optimizations for performance improvements.
Contribution
It provides a detailed characterization of training workloads, performance bottlenecks, and potential hardware/software optimizations for Alibaba's AI cloud.
Findings
Weight/gradient communication accounts for 62% of execution time.
Computation is not the primary bottleneck.
Upgrading interconnects and architecture can significantly improve performance.
Abstract
Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Adversarial Robustness in Machine Learning
