Fault-tolerant Reduce and Allreduce operations based on correction

Martin Kuettler; Hermann Haertig

arXiv:2602.22445·cs.DC·February 27, 2026

Fault-tolerant Reduce and Allreduce operations based on correction

Martin Kuettler, Hermann Haertig

PDF

Open Access

TL;DR

This paper introduces a fault-tolerant Reduce and Allreduce algorithm that incorporates a correction phase before a tree-based communication, enhancing resilience to process failures in distributed systems.

Contribution

It presents a novel fault-tolerant Reduce algorithm with proven semantics, extending to Allreduce by combining with Broadcast, based on correction-based communication phases.

Findings

01

The algorithm tolerates multiple process failures.

02

Semantics of the fault-tolerant Reduce are formally proven.

03

Combined approach improves robustness of collective operations.

Abstract

Implementations of Broadcast based on some information dissemination algorithm -- e.g., gossip or tree-based communication -- followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Peer-to-Peer Network Technologies · Software System Performance and Reliability