Configurable Non-uniform All-to-all Algorithms
Ke Fan, Jens Domke, Seydou Ba, Sidharth Kumar

TL;DR
This paper presents Tunable Non-uniform All-to-all algorithms that optimize communication by considering system architecture and data variability, significantly outperforming existing MPI implementations.
Contribution
Introduction of flexible, hierarchical all-to-all algorithms that adapt to system architecture and data size variations, improving performance over current methods.
Findings
Achieved up to 138x speedup on Fugaku supercomputer.
Effectively balances bandwidth and latency in communication.
Outperforms state-of-the-art MPI implementations by large margins.
Abstract
MPI_Alltoallv generalizes the uniform all-to-all communication (MPI_Alltoall) by enabling the exchange of data blocks of varied sizes among processes. This function plays a crucial role in many applications, such as FFT computation and relational algebra operations. Popular MPI libraries, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of linear and logarithmic algorithms. However, MPI_Alltoallv typically relies only on variations of linear algorithms, missing the benefits of logarithmic approaches. Furthermore, current algorithms also overlook the intricacies of modern HPC system architectures, such as the significant performance gap between intra-node (local) and inter-node (global) communication. This paper introduces a set of Tunable Non-uniform All-to-all algorithms, denoted TuNA{l}{g}, where g and l refer to global (inter-node) and local (intra-node)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
