Falkirk Wheel: Rollback Recovery for Dataflow Systems
Michael Isard, Mart\'in Abadi

TL;DR
This paper introduces a flexible rollback recovery model for distributed dataflow systems using multiple logical time domains, enabling efficient checkpointing and selective rollback for complex applications.
Contribution
It proposes a novel logical time-based rollback scheme with multiple domains and selective rollback, improving recovery flexibility in dataflow systems.
Findings
Supports combining batch and streaming processing
Enables application-specific checkpointing policies
Demonstrates implementation in Naiad system
Abstract
We present a new model for rollback recovery in distributed dataflow systems. We explain existing rollback schemes by assigning a logical time to each event such as a message delivery. If some processors fail during an execution, the system rolls back by selecting a set of logical times for each processor. The effect of events at times within the set is retained or restored from saved state, while the effect of other events is undone and re-executed. We show that, by adopting different logical time "domains" at different processors, an application can adopt appropriate checkpointing schemes for different parts of its computation. We illustrate with an example of an application that combines batch processing with low-latency streaming updates. We show rules, and an algorithm, to determine a globally consistent state for rollback in a system that uses multiple logical time domains. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
