Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

Dragana Grbic (Department of Computer Science; Rice University)

arXiv:2605.03561·cs.DC·May 12, 2026

Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

Dragana Grbic (Department of Computer Science, Rice University)

PDF

TL;DR

This paper introduces a high-performance, GPU-accelerated framework for large-scale performance diagnostics in exascale systems, enabling rapid analysis and localization of network issues.

Contribution

It presents a novel, scalable infrastructure with GPU acceleration and topology-aware diagnostics for exascale performance analysis, improving speed and insight.

Findings

01

Achieves 9.69-second ingestion for 100,000 MPI ranks on Aurora.

02

GPU layer speeds up analysis by up to 314x over CPU.

03

Identifies a 32.28% potential speedup for GAMESS on Frontier.

Abstract

As exascale systems reach unprecedented concurrency, traditional performance analysis tools struggle with the overhead of massive-scale telemetry. We present an accelerated infrastructure for the hpcanalysis framework that leverages a high-performance C++ API and GPU parallelism to enable high-throughput diagnostics. Our C++ API achieves a 9.69-second ingestion time for 100,000 MPI ranks on Aurora. Furthermore, our GPU-accelerated layer achieves up to 314x speedup over CPU-based processing when analyzing 100,000 execution traces. Finally, we implement a topology-aware workflow that maps logical performance outliers to physical Slingshot interconnect coordinates, localizing network congestion across 22 distinct racks on Aurora. We also demonstrate how the framework's advanced interface seamlessly integrates with external tools to provide sophisticated analytical models. We introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.