Reliability and Fault-Tolerance by Choreographic Design
Ian Cassar (Reykjavik University), Adrian Francalanza (University of, Malta), Claudio Antares Mezzina (IMT School for Advanced Studies Lucca,, Italy), Emilio Tuosto (University of Leicester, UK)

TL;DR
This paper introduces a formal approach for designing reliable and fault-tolerant distributed programs by specifying recovery strategies abstractly, enabling automated synthesis of monitoring and adaptation mechanisms.
Contribution
It proposes formal abstractions for failure recovery and adaptation in message-passing programs, facilitating automatic code generation and debugging.
Findings
Formal behavioral models for failures defined
Properties of adaptation strategies specified
Potential for automated monitoring and recovery code synthesis
Abstract
Distributed programs are hard to get right because they are required to be open, scalable, long-running, and tolerant to faults. In particular, the recent approaches to distributed software based on (micro-)services where different services are developed independently by disparate teams exacerbate the problem. In fact, services are meant to be composed together and run in open context where unpredictable behaviours can emerge. This makes it necessary to adopt suitable strategies for monitoring the execution and incorporate recovery and adaptation mechanisms so to make distributed programs more flexible and robust. The typical approach that is currently adopted is to embed such mechanisms in the program logic, which makes it hard to extract, compare and debug. We propose an approach that employs formal abstractions for specifying failure recovery and adaptation strategies. Although…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
