Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora
Kazushige Goto, Huda Ibeid, Kalyan Kumaran, Servesh Muralidharan, Anthony-Trung Nguyen, Aditya Nishtala

TL;DR
This paper shares lessons from deploying HPL and HPL-MxP on Aurora, an exascale system with Intel GPUs and large-scale interconnects, highlighting system-level choices that sustain high performance.
Contribution
It reports real-world deployment experiences and identifies key system-level strategies that enable sustained exascale performance on heterogeneous architectures.
Findings
Aurora achieved 1.01EF/s in FP64 HPL and 11.64EF/s with HPL-MxP, an 11.5x speedup.
System-level choices like resource mapping and mixed-precision orchestration are crucial.
Lessons from Aurora are applicable to other large-scale heterogeneous systems.
Abstract
Sustaining exascale performance in production requires engineering choices and operational practices that emerge only under real deployment constraints and demand coordination across system layers. This paper reports experience from three successive campaigns running HPL and HPL-MxP on Aurora, an Intel-based exascale system featuring the first large-scale deployment of Intel discrete GPUs, CPU-attached network interfaces, and the largest production Slingshot-11 interconnect. Aurora progressed from 0.585EF/s on 5,439 nodes to 1.01EF/s on 9,234 nodes in FP64 HPL, while HPL-MxP reached 11.64EF/s, an 11.5x speedup over FP64 enabled by mixed-precision arithmetic and Intel AMX acceleration. We identify and classify by role at production scale the system-level choices that sustained these results, including deterministic locality-aware resource mapping, explicit CPU-GPU pipelining,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
