Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault   Tolerance

Jovan Nikolic; Nursultan Jubatyrov; Evangelos Pournaras

arXiv:2007.05261·cs.DC·June 25, 2021

Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras

PDF

TL;DR

This paper models and measures fault scenarios in large-scale decentralized systems to improve self-healing by balancing fault correction and fault tolerance, validated through extensive experiments and real-world data.

Contribution

It introduces a novel modeling approach for fault scenarios and an experimental methodology to evaluate self-healing strategies in decentralized networks.

Findings

01

Fault origin can be identified at the design phase.

02

Model predictions accurately match real-world fault behaviors.

03

Experimental results guide optimal fault detection thresholds.

Abstract

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process indistinguishable from a slow process. In the case of a slow process without fault, fault correction is undesirable as it can trigger new faults that could be prevented with fault tolerance that is a more proactive system maintenance. But in the case of an actual faulty process, fault tolerance alone without eventually correcting persistent faults can make systems underperforming. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several energy, transport and health applications. This paper contributes a novel and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.