HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models
Antonis Psistakis, Burak Ocalan, Fabien Chaix, Ramnatthan Alagappan, and Josep Torrellas

TL;DR
HEAL is a low-overhead, online incremental recovery scheme for leaderless distributed systems that significantly reduces recovery time and throughput impact compared to traditional methods, ensuring quick fault recovery across various persistency models.
Contribution
This paper introduces HEAL, a novel recovery scheme for leaderless distributed systems that enables fast, online fault recovery with minimal performance impact, adaptable to different persistency models.
Findings
HEAL recovers in 120 ms on average, compared to 360 seconds for conventional schemes.
HEAL reduces throughput degradation to 8.7%, versus 16.2% for traditional recovery.
HEAL achieves 20.7x faster recovery latency and 62.4% less throughput loss than leader-based recovery schemes.
Abstract
Ensuring resilience in distributed systems has become an acute concern. In today's environment, it is crucial to develop light-weight mechanisms that recover a distributed system from faults quickly and with only a small impact on the live-system throughput. To address this need, this paper proposes a new low-overhead, general recovery scheme for modern non-transactional leaderless distributed systems. We call our scheme HEAL. On a node failure, HEAL performs an optimized online incremental recovery. This paper presents HEAL's algorithms for settings with Linearizable consistency and different memory persistency models. We implement HEAL on a 6-node Intel cluster. Our experiments running TAOBench workloads show that HEAL is very effective. HEAL recovers the cluster in 120 milliseconds on average, while reducing the throughput of the running workload by an average of 8.7%. In contrast, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Distributed and Parallel Computing Systems
