Node-Aware Improvements to Allreduce
Amanda Bienz, Luke N. Olson, and William D. Gropp

TL;DR
This paper introduces a node-aware allreduce algorithm that leverages multiple processes per node to reduce inter-node communication, enhancing performance especially for small message sizes.
Contribution
It proposes a novel algorithm that uses multiple processes per node to minimize inter-node messages, improving upon existing node-aware allreduce methods.
Findings
Reduces inter-node messages compared to previous methods.
Improves allreduce performance for small message sizes.
Utilizes multiple processes per node effectively.
Abstract
The \texttt{MPI\_Allreduce} collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on node-agnostic performance models. However, this algorithm yields duplicate messages between sets of nodes. Node-aware optimizations in MPICH remove duplicate messages through use of a single master process per node, yielding a large number of inactive processes at each inter-node step. In this paper, we present an algorithm that uses the multiple processes available per node to reduce the maximum number of inter-node messages communicated by a single process, improving the performance of allreduce operations, particularly for small message sizes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
