Exploring Fully Offloaded GPU Stream-Aware Message Passing
Naveen Namashivayam, Krishna Kandalla, James B White III, Larry, Kaplan, Mark Pagel

TL;DR
This paper introduces a stream-triggered offload strategy for GPU-aware message passing that reduces CPU involvement, leading to significant on-node performance gains in heterogeneous supercomputing systems.
Contribution
The paper proposes a novel offload-friendly communication strategy, stream-triggered (ST) communication, enabling GPU-based synchronization and data transfer offloading in MPI implementations.
Findings
On-node performance improved by 36% over standard MPI active RMA.
On-node point-to-point communication improved by 61%.
Multi-node performance is 23% better than active RMA but slightly slower than point-to-point.
Abstract
Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Distributed and Parallel Computing Systems
