ScalAna: Automating Scaling Loss Detection with Graph Analysis
Yuyang Jin, Haojie Wang, Teng Yu, Xiongchao Tang, Torsten, Hoefler, Xu Liu, Jidong Zhai

TL;DR
ScalAna introduces a static analysis-based tool that efficiently detects root causes of scaling bottlenecks in parallel programs with minimal overhead, improving performance on supercomputers.
Contribution
It combines static compiler analysis with lightweight runtime data collection to enable detailed root-cause analysis of scaling issues at low overhead.
Findings
Effectively locates scaling bottlenecks in real applications.
Incurs only 1.73% overhead on average.
Achieves up to 11.11% performance improvement.
Abstract
Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
