Exploring Fully Offloaded GPU Stream-Aware Message Passing

Naveen Namashivayam; Krishna Kandalla; James B White III; Larry; Kaplan; Mark Pagel

arXiv:2306.15773·cs.DC·June 29, 2023

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Naveen Namashivayam, Krishna Kandalla, James B White III, Larry, Kaplan, Mark Pagel

PDF

Open Access

TL;DR

This paper introduces a stream-triggered offload strategy for GPU-aware message passing that reduces CPU involvement, leading to significant on-node performance gains in heterogeneous supercomputing systems.

Contribution

The paper proposes a novel offload-friendly communication strategy, stream-triggered (ST) communication, enabling GPU-based synchronization and data transfer offloading in MPI implementations.

Findings

01

On-node performance improved by 36% over standard MPI active RMA.

02

On-node point-to-point communication improved by 61%.

03

Multi-node performance is 23% better than active RMA but slightly slower than point-to-point.

Abstract

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Distributed and Parallel Computing Systems