Fault-tolerant Reduce and Allreduce operations based on correction
Martin Kuettler, Hermann Haertig

TL;DR
This paper introduces a fault-tolerant Reduce and Allreduce algorithm that incorporates a correction phase before a tree-based communication, enhancing resilience to process failures in distributed systems.
Contribution
It presents a novel fault-tolerant Reduce algorithm with proven semantics, extending to Allreduce by combining with Broadcast, based on correction-based communication phases.
Findings
The algorithm tolerates multiple process failures.
Semantics of the fault-tolerant Reduce are formally proven.
Combined approach improves robustness of collective operations.
Abstract
Implementations of Broadcast based on some information dissemination algorithm -- e.g., gossip or tree-based communication -- followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Peer-to-Peer Network Technologies · Software System Performance and Reliability
