Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication
Evelyn Namugwanya

TL;DR
This paper evaluates persistent MPI RMA Alltoallv implementations, demonstrating significant performance improvements for large messages and analyzing the trade-offs between synchronization methods on HPC systems.
Contribution
It introduces and benchmarks persistent RMA Alltoallv variants, showing their effectiveness for large message sizes and comparing fence and lock synchronization strategies.
Findings
Persistent RMA Alltoallv reduces runtime by up to 44% for large messages.
Performance benefits are significant for messages ≥ 32,768 bytes.
Fence-based synchronization generally outperforms lock-based methods for large messages.
Abstract
Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and lock synchronization, separating a one time initialization phase from per iteration execution to enable reuse of communication metadata and window state across repeated epochs. Our benchmarks tested on LLNL's Dane supercomputer show that the fence-persistent variant consistently outperforms the non-persistent baseline for large message sizes, achieving up to 44% reduction in runtime and improving scalability with increasing process counts; at 448 processes the runtime decreases from 2.49s to 1.54s (38% faster). We further evaluate the algorithms under irregular sparse communication patterns and compare fence- and lock-based designs, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
