Reliable Actors with Retry Orchestration
Olivier Tardieu (IBM Research), David Grove (IBM Research),, Gheorghe-Teodor Bercea (IBM Research), Paul Castro (IBM Research), Jaroslaw, Cwiklik (IBM Research), Edward Epstein (IBM Research)

TL;DR
This paper proposes a fault-tolerant actor-based cloud programming model using retry orchestration and tail calls, ensuring reliable execution despite failures, with formalization, implementation, and performance evaluation.
Contribution
It introduces a novel fault-tolerance model for cloud actors that guarantees correct retries and failure handling, formalized through a process calculus.
Findings
Guarantees correct retries and failure handling in cloud actors.
Formal process calculus models fault-tolerance mechanisms.
Implementation validates correctness and performance impact.
Abstract
Cloud developers have to build applications that are resilient to failures and interruptions. We advocate for a fault-tolerant programming model for the cloud based on actors, retry orchestration, and tail calls. This model builds upon persistent data stores and messages queues readily available on the cloud. Retry orchestration not only guarantees that (1) failed actor invocations will be retried but also that (2) completed invocations are never repeated and (3) it preserves a strict happen-before relationship across failures within call stacks. Tail calls can break complex tasks into simple steps to minimize re-execution during recovery. We review key application patterns and failure scenarios. We formalize a process calculus to precisely capture the mechanisms of fault tolerance in this model. We briefly describe our implementation. Using an application inspired by a typical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Cloud Data Security Solutions
