Preparing for Performance Analysis at Exascale

Jonathon Anderson; Yumeng Liu; John Mellor-Crummey

arXiv:2108.04002·cs.DC·March 11, 2022

Preparing for Performance Analysis at Exascale

Jonathon Anderson, Yumeng Liu, John Mellor-Crummey

PDF

Open Access

TL;DR

This paper introduces a streaming aggregation method for analyzing large-scale, sparse performance data from exascale heterogeneous systems, significantly improving analysis speed and data compactness.

Contribution

It presents a novel parallel postmortem analysis approach that efficiently handles sparse, heterogeneous performance measurements at exascale, outperforming existing tools in speed and data size.

Findings

01

Analyzes large-scale GPU-accelerated applications faster than HPCToolkit.

02

Produces sparse performance profiles that are much smaller than dense representations.

03

Achieves over an order of magnitude speedup in performance analysis.

Abstract

Performance tools for emerging heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of large-scale executions may record mountains of performance data. Second, performance measurements for parallel programs are sparse in two ways: the set of metrics present for any context and the set of contexts present in different threads. For GPU-accelerated applications, an important source of sparsity is that none of the myriad of GPU metrics apply to any of the many CPU contexts. To address these challenges, we developed a novel streaming aggregation approach to postmortem analysis that employs both shared and distributed memory parallelism to aggregate sparse performance measurements from every rank, thread, and GPU stream of an application, and attributes heterogeneous call path profiles and traces to source code. Using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Parallel Computing and Optimization Techniques