Tolerating Correlated Failures in Massively Parallel Stream Processing Engines
Li Su, Yongluan Zhou

TL;DR
This paper introduces a hybrid fault-tolerance framework for massively parallel stream processing engines that combines passive checkpointing with selective active replication to improve recovery speed and resource efficiency.
Contribution
The paper proposes the Passive and Partially Active (PPA) framework, optimizing fault tolerance by combining passive and active methods tailored for MPSPEs, with algorithms for planning replication strategies.
Findings
PPA reduces recovery latency for correlated failures.
Selective active replication improves resource utilization.
Experimental results confirm effectiveness on real and synthetic data.
Abstract
Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Advanced Database Systems and Queries
