Optimizing High-Performance Linpack for Exascale Accelerated   Architectures

Noel Chalmers; Jakub Kurzak; Damon McDougall; Paul T. Bauman

arXiv:2304.10397·cs.DC·April 21, 2023·1 cites

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

Noel Chalmers, Jakub Kurzak, Damon McDougall, Paul T. Bauman

PDF

Open Access

TL;DR

This paper presents rocHPL, an optimized implementation of the HPL benchmark for exascale architectures, utilizing GPU accelerators and CPU optimizations to improve performance on systems like the Frontier supercomputer.

Contribution

The paper introduces novel optimization techniques for HPL, including multi-threaded CPU panel factorization and communication hiding, tailored for exascale accelerated architectures.

Findings

01

Achieved high-performance results on Frontier's single node.

02

Demonstrated effective scaling across multiple nodes.

03

Implemented CPU-GPU hybrid optimization strategies.

Abstract

We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterconnection Networks and Systems · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems