Optimal Multi-Level Interval-based Checkpointing for Exascale Stream Processing Systems
Sachini Jayasekara, Aaron Harwood, Shanika Karunasekera

TL;DR
This paper develops a theoretical framework for optimal multi-level interval-based checkpointing in stream processing systems, addressing scalability issues at Exascale by determining ideal checkpoint intervals and probabilities.
Contribution
It introduces a stochastic model for multi-level checkpointing, deriving optimal parameters considering failure rates and costs, which was previously lacking in theoretical basis.
Findings
Optimal checkpoint intervals and probabilities derived mathematically
Model validated through stochastic simulation
Practical experiments confirm theoretical results
Abstract
State-of-the-art stream processing platforms make use of checkpointing to support fault tolerance, where a "checkpoint tuple" flows through the topology to all operators, indicating a checkpoint and triggering a checkpoint operation. The checkpoint will enable recovering from any kind of failure, be it as localized as a process fault or as wide spread as power supply loss to an entire rack of machines. As we move towards Exascale computing, it is becoming clear that this kind of "single-level" checkpointing is too inefficient to scale. Some HPC researchers are now investigating multi-level checkpointing, where checkpoint operations at each level are tailored to specific kinds of failure to address the inefficiencies of single-level checkpointing. Multi-level checkpointing has been shown in practice to be superior, giving greater efficiency in operation over single-level checkpointing.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Cloud Computing and Resource Management
