Scaling the memory wall using mixed-precision -- HPG-MxP on an exascale machine

Aditya Kashi; Nicholson Koukpaizan; Hao Lu; Michael Matheson; Sarp Oral; Feiyi Wang

arXiv:2507.11512·cs.DC·July 16, 2025

Scaling the memory wall using mixed-precision -- HPG-MxP on an exascale machine

Aditya Kashi, Nicholson Koukpaizan, Hao Lu, Michael Matheson, Sarp Oral, Feiyi Wang

PDF

Open Access

TL;DR

This paper demonstrates a highly optimized implementation of the HPG-MxP benchmark on an exascale system, achieving a 1.6x speedup using mixed-precision algorithms for sparse matrix computations on GPU supercomputers.

Contribution

It presents the first practical implementation of the HPG-MxP benchmark on an exascale system with algorithm enhancements and reports significant speedup using mixed-precision techniques.

Findings

01

Achieved 1.6x speedup with mixed-precision on GPU supercomputers.

02

Optimized implementation of HPG-MxP benchmark for exascale systems.

03

Demonstrated practical benefits of mixed-precision in memory bandwidth limited applications.

Abstract

Mixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for artificial intelligence (AI) on recent high performance computing (HPC) platforms. A few applications dominated by dense matrix operations have seen substantial speedups by utilizing low precision formats such as FP16. However, a majority of scientific simulation applications are memory bandwidth limited. Beyond preliminary studies, the practical gain from using mixed-precision algorithms on a given HPC system is largely unclear. The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of a HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Algorithms and Applications · Parallel Computing and Optimization Techniques