Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance
Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras

TL;DR
This paper models and measures fault scenarios in large-scale decentralized systems to improve self-healing by balancing fault correction and fault tolerance, validated through extensive experiments and real-world data.
Contribution
It introduces a novel modeling approach for fault scenarios and an experimental methodology to evaluate self-healing strategies in decentralized networks.
Findings
Fault origin can be identified at the design phase.
Model predictions accurately match real-world fault behaviors.
Experimental results guide optimal fault detection thresholds.
Abstract
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process indistinguishable from a slow process. In the case of a slow process without fault, fault correction is undesirable as it can trigger new faults that could be prevented with fault tolerance that is a more proactive system maintenance. But in the case of an actual faulty process, fault tolerance alone without eventually correcting persistent faults can make systems underperforming. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several energy, transport and health applications. This paper contributes a novel and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
