A Doubly-pipelined, Dual-root Reduction-to-all Algorithm and Implementation
Jesper Larsson Tr\"aff

TL;DR
This paper introduces a novel, doubly-pipelined, dual-root reduction algorithm for MPI_Allreduce that exploits bidirectional communication, improving performance on parallel systems by optimizing communication steps and pipeline block size.
Contribution
The paper presents a new binary tree-based, doubly-pipelined allreduce algorithm with dual roots, enhancing efficiency by leveraging bidirectional communication capabilities.
Findings
Achieves lower latency with optimal pipeline block size.
Outperforms traditional reduce-broadcast and native MPI_Allreduce.
Effective on small, modern processor clusters.
Abstract
We discuss a simple, binary tree-based algorithm for the collective allreduce (reduction-to-all, MPI_Allreduce) operation for parallel systems consisting of suitably interconnected processors. The algorithm can be doubly pipelined to exploit bidirectional (telephone-like) communication capabilities of the communication system. In order to make the algorithm more symmetric, the processors are organized into two rooted trees with communication between the two roots. For each pipeline block, each non-leaf processor takes three communication steps, consisting in receiving and sending from and to the two children, and sending and receiving to and from the root. In a round-based, uniform, linear-cost communication model in which simultaneously sending and receiving data elements takes time for system dependent constants (communication start-up latency) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
