A Bandwidth-saving Optimization for MPI Broadcast Collective Operation
Huan Zhou, Vladimir Marjanovic, Christoph Niethammer, Jos\'e Gracia

TL;DR
This paper proposes a bandwidth-optimized broadcast method for MPI in high-performance computing, significantly improving performance for large and medium-sized messages on non-power-of-two process counts.
Contribution
It introduces a tuned broadcast approach that enhances the existing MPICH implementation, specifically targeting large messages and non-power-of-two process configurations.
Findings
Performance improved by up to 54% for large messages.
Bandwidth savings achieved for medium messages with non-power-of-two processes.
Validated on Cray XC40 cluster with various data sizes.
Abstract
The efficiency and scalability of MPI collective operations, in particular the broadcast operation, plays an integral part in high performance computing applications. MPICH, as one of the contemporary widely-used MPI software stacks, implements the broadcast operation based on point-to-point operation. Depending on the parameters, such as message size and process count, the library chooses to use different algorithms, as for instance binomial dissemination, recursive-doubling exchange or ring all-to-all broadcast (allgather). However, the existing broadcast design in latest release of MPICH does not provide good performance for large messages (\textit{lmsg}) or medium messages with non-power-of-two process counts (\textit{mmsg-npof2}) due to the inner suboptimal ring allgather algorithm. In this paper, based on the native broadcast design in MPICH, we propose a tuned broadcast approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
