Automatic Performance Debugging of SPMD-style Parallel Programs
Xu Liu, Jianfeng Zhan, Kunlin Zhan, Weisong Shi, Lin Yuan, Dan Meng,, and Lei Wang

TL;DR
AutoAnalyzer is a system that automatically detects, locates, and explains performance bottlenecks in SPMD-style parallel programs without prior knowledge, using lightweight data analysis and innovative algorithms.
Contribution
The paper introduces new clustering and searching algorithms, and a root cause analysis method based on rough set theory, for automated performance debugging of parallel programs.
Findings
Effective bottleneck detection and root cause identification demonstrated on real applications.
Lightweight data collection reduces overhead in performance analysis.
Automated debugging improves efficiency and accuracy in optimizing parallel programs.
Abstract
The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any apriori knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Distributed and Parallel Computing Systems
