A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems
Sachini Jayasekara, Aaron Harwood, Shanika Karunasekera

TL;DR
This paper develops a comprehensive model to optimize checkpoint intervals in distributed stream processing systems, improving fault tolerance efficiency by considering system parameters and failure characteristics.
Contribution
It introduces a rigorous utilization-based model for checkpoint interval optimization that accounts for multiple system factors, providing a theoretical basis for enhanced fault-tolerance strategies.
Findings
Model accurately predicts optimal checkpoint intervals.
Experiments show improved system utilization with the model.
Performance gains increase with system size.
Abstract
State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization -- the fraction of total time available for the system to do useful work -- that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
