Persistent and Partitioned MPI for Stencil Communication
Gerald Collom, Jason Burmark, Olga Pearce, Amanda Bienz

TL;DR
This paper evaluates how persistent and partitioned MPI communication optimizations improve stencil operation performance, demonstrating up to 68% speedup in large-scale parallel applications.
Contribution
It provides a detailed analysis of the performance benefits of persistent and partitioned MPI communication routines for stencil operations at various scales.
Findings
Persistent MPI can speed up communication by up to 37%.
Partitioned MPI can speed up communication by up to 68%.
Performance improvements depend on process count, thread count, and message size.
Abstract
Many parallel applications rely on iterative stencil operations, whose performance are dominated by communication costs at large scales. Several MPI optimizations, such as persistent and partitioned communication, reduce overheads and improve communication efficiency through amortized setup costs and reduced synchronization of threaded sends. This paper presents the performance of stencil communication in the Comb benchmarking suite when using non blocking, persistent, and partitioned communication routines. The impact of each optimization is analyzed at various scales. Further, the paper presents an analysis of the impact of process count, thread count, and message size on partitioned communication routines. Measured timings show that persistent MPI communication can provide a speedup of up to 37% over the baseline MPI communication, and partitioned MPI communication can provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
