HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

Antonis Psistakis; Burak Ocalan; Fabien Chaix; Ramnatthan Alagappan; and Josep Torrellas

arXiv:2602.08257·cs.DC·February 10, 2026

HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

Antonis Psistakis, Burak Ocalan, Fabien Chaix, Ramnatthan Alagappan, and Josep Torrellas

PDF

Open Access

TL;DR

HEAL is a low-overhead, online incremental recovery scheme for leaderless distributed systems that significantly reduces recovery time and throughput impact compared to traditional methods, ensuring quick fault recovery across various persistency models.

Contribution

This paper introduces HEAL, a novel recovery scheme for leaderless distributed systems that enables fast, online fault recovery with minimal performance impact, adaptable to different persistency models.

Findings

01

HEAL recovers in 120 ms on average, compared to 360 seconds for conventional schemes.

02

HEAL reduces throughput degradation to 8.7%, versus 16.2% for traditional recovery.

03

HEAL achieves 20.7x faster recovery latency and 62.4% less throughput loss than leader-based recovery schemes.

Abstract

Ensuring resilience in distributed systems has become an acute concern. In today's environment, it is crucial to develop light-weight mechanisms that recover a distributed system from faults quickly and with only a small impact on the live-system throughput. To address this need, this paper proposes a new low-overhead, general recovery scheme for modern non-transactional leaderless distributed systems. We call our scheme HEAL. On a node failure, HEAL performs an optimized online incremental recovery. This paper presents HEAL's algorithms for settings with Linearizable consistency and different memory persistency models. We implement HEAL on a 6-node Intel cluster. Our experiments running TAOBench workloads show that HEAL is very effective. HEAL recovers the cluster in 120 milliseconds on average, while reducing the throughput of the running workload by an average of 8.7%. In contrast, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Distributed and Parallel Computing Systems