TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes
Carl Pearson, Kun Wu, I-Hsin Chung, Jinjun Xiong, Wen-Mei Hwu

TL;DR
This paper introduces TEMPI, a novel MPI interposer library that optimizes handling of CUDA-aware non-contiguous datatypes, significantly improving MPI communication performance on GPU-enabled systems.
Contribution
It presents a new datatype handling strategy for nested strided datatypes and models non-contiguous data handling performance to transparently enhance MPI communication latency.
Findings
MPI_Pack speedup of up to 242000x
MPI_Send speedup of up to 59000x
More than 917x speedup in a 3D halo exchange
Abstract
MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
