From Reversible Computation to Checkpoint-Based Rollback Recovery for   Message-Passing Concurrent Programs

Germ\'an Vidal

arXiv:2309.04873·cs.PL·November 15, 2023

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

Germ\'an Vidal

PDF

Open Access

TL;DR

This paper introduces a new rollback recovery method for message-passing concurrent programs using explicit checkpointing and reversible semantics to enhance fault tolerance.

Contribution

It presents a novel checkpointing and rollback recovery strategy specifically designed for message-passing concurrent systems, leveraging reversible semantics.

Findings

01

Effective recovery from failures demonstrated

02

Reduced rollback complexity in message-passing systems

03

Enhanced fault tolerance through reversible semantics

Abstract

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing operators and the use of a (partially) reversible semantics for rolling back the system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Radiation Effects in Electronics · Security and Verification in Computing