Lightweight Asynchronous Snapshots for Distributed Dataflows
Paris Carbone, Gyula F\'ora, Stephan Ewen, Seif Haridi, Kostas Tzoumas

TL;DR
This paper introduces Asynchronous Barrier Snapshotting (ABS), a lightweight, efficient snapshotting algorithm for distributed dataflows that minimizes space and performance impact during failure recovery.
Contribution
The paper presents ABS, a novel asynchronous snapshotting method tailored for modern dataflow engines, reducing snapshot size and disruption compared to existing approaches.
Findings
ABS maintains linear scalability during snapshots.
The algorithm reduces snapshot size by persisting only operator states.
ABS performs well with frequent snapshots in distributed dataflows.
Abstract
Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Scientific Computing and Data Management
