# Self-Stabilizing Snapshot Objects for Asynchronous Fail-Prone Network   Systems

**Authors:** Chryssis Georgiou, Oskar Lundstr\"om, Elad Michael Schiller

arXiv: 1906.06420 · 2020-03-03

## TL;DR

This paper introduces self-stabilizing algorithms for snapshot objects in asynchronous crash-prone networks, enhancing fault-tolerance with quick recovery and maintaining similar communication costs to existing algorithms.

## Contribution

It extends previous fault-tolerant snapshot algorithms by incorporating self-stabilization, enabling recovery from arbitrary transient faults while preserving efficiency.

## Key findings

- Algorithms recover in O(1) time from transient faults.
- Performance evaluation shows reduced latency and communication costs.
- Validation confirms correctness and robustness of the proposed algorithms.

## Abstract

A snapshot object simulates the behavior of an array of single-writer/multi-reader shared registers that can be read atomically. Delporte-Gallet et al. proposed two fault-tolerant algorithms for snapshot objects in asynchronous crash-prone message-passing systems. Their first algorithm is \emph{non-blocking}; it allows snapshot operations to terminate once all write operations have ceased. It uses $O(n)$ messages of $O(n \nu)$ bits, where $n$ is the number of nodes and $\nu$ is the number of bits it takes to represent the object. Their second algorithm allows snapshot operations to always terminate independently of write operations. It incurs $O(n^2)$ messages.   The fault model of Delporte-Gallet et al. considers node crashes. We aim at the design of even more robust snapshot objects via the lenses of self-stabilization---a very strong notion of fault-tolerance. In addition to Delporte-Gallet et al.'s fault model, our self-stabilizing algorithm can recover after the occurrence of transient faults; these faults represent arbitrary violations of the assumptions according to which the system was designed to operate.   We propose self-stabilizing variations of Delporte-Gallet et al.'s non-blocking algorithm and always-terminating algorithm. Our algorithms have similar communication costs to the ones by Delporte-Gallet et al. and $O(1)$ recovery time from transient faults. The main differences are that our proposal considers repeated gossiping of $O(\nu)$ bit messages and deals with bounded space. We also consider an input parameter, $\delta$, for which we claim an ability to balance the costs of snapshot operations. We validate our correctness proof, evaluate the performance of Delporte-Gallet et al.'s algorithms and our proposed variations and investigate the properties of $\delta$ via PlanetLab experiments, where significant latency and communication costs reduction are observed.

---
Source: https://tomesphere.com/paper/1906.06420