Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
Dragana Grbic (Department of Computer Science, Rice University)

TL;DR
This paper introduces a high-performance, GPU-accelerated framework for large-scale performance diagnostics in exascale systems, enabling rapid analysis and localization of network issues.
Contribution
It presents a novel, scalable infrastructure with GPU acceleration and topology-aware diagnostics for exascale performance analysis, improving speed and insight.
Findings
Achieves 9.69-second ingestion for 100,000 MPI ranks on Aurora.
GPU layer speeds up analysis by up to 314x over CPU.
Identifies a 32.28% potential speedup for GAMESS on Frontier.
Abstract
As exascale systems reach unprecedented concurrency, traditional performance analysis tools struggle with the overhead of massive-scale telemetry. We present an accelerated infrastructure for the hpcanalysis framework that leverages a high-performance C++ API and GPU parallelism to enable high-throughput diagnostics. Our C++ API achieves a 9.69-second ingestion time for 100,000 MPI ranks on Aurora. Furthermore, our GPU-accelerated layer achieves up to 314x speedup over CPU-based processing when analyzing 100,000 execution traces. Finally, we implement a topology-aware workflow that maps logical performance outliers to physical Slingshot interconnect coordinates, localizing network congestion across 22 distinct racks on Aurora. We also demonstrate how the framework's advanced interface seamlessly integrates with external tools to provide sophisticated analytical models. We introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
