From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs
Germ\'an Vidal

TL;DR
This paper introduces a new rollback recovery method for message-passing concurrent programs using explicit checkpointing and reversible semantics to enhance fault tolerance.
Contribution
It presents a novel checkpointing and rollback recovery strategy specifically designed for message-passing concurrent systems, leveraging reversible semantics.
Findings
Effective recovery from failures demonstrated
Reduced rollback complexity in message-passing systems
Enhanced fault tolerance through reversible semantics
Abstract
The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing operators and the use of a (partially) reversible semantics for rolling back the system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Radiation Effects in Electronics · Security and Verification in Computing
