A Fault-Tolerant Version of Safra's Termination Detection Algorithm

Wan Fokkink; Georgios Karlos; Andy Tatman

arXiv:2602.00272·cs.DC·February 3, 2026

A Fault-Tolerant Version of Safra's Termination Detection Algorithm

Wan Fokkink, Georgios Karlos, Andy Tatman

PDF

Open Access

TL;DR

This paper presents a fault-tolerant adaptation of Safra's termination detection algorithm that handles node crashes without extra message overhead, ensuring reliable distributed termination detection in unreliable networks.

Contribution

It introduces a decentralized fault-tolerant version of Safra's algorithm with crash handling, local ring restoration, and crash notifications, without increasing message complexity.

Findings

01

The algorithm tolerates any number of crashes.

02

Correctness is proven through formal proofs and model checking.

03

No additional message overhead is introduced.

Abstract

Safra's distributed termination detection algorithm employs a logical token ring structure within a distributed network; only passive nodes forward the token, and a counter in the token keeps track of the number of sent minus the number of received messages. We adapt this classic algorithm to make it fault-tolerant. The counter is split into counters per node, to discard counts from crashed nodes. If a node crashes, the token ring is restored locally and a backup token is sent. Nodes inform each other of detected crashes via the token. Our algorithm imposes no additional message overhead, tolerates any number of crashes as well as simultaneous crashes, and copes with crashes in a decentralized fashion. Correctness proofs are provided of both the original Safra's algorithm and its fault-tolerant variant, as well as a model checking analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Peer-to-Peer Network Technologies