Understanding Performance Problems in Deep Learning Systems
Junming Cao, Bihuan Chen, Chao Sun, Longjie Hu, Shuaihong Wu, Xin Peng

TL;DR
This paper provides a comprehensive analysis of performance problems in deep learning systems, characterizing their symptoms and root causes, and introduces a static checker tool that detects and helps fix these issues.
Contribution
It is the first study to systematically characterize performance problems in DL systems and evaluate existing analysis approaches, leading to the development of a practical static checker tool.
Findings
224 PPs collected from StackOverflow posts
Deep-Perf detected 488 new PPs in GitHub projects
105 PPs confirmed and fixed
Abstract
Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
