Asymptotic efficiency of restart and checkpointing
Antonio Sodre

TL;DR
This paper analyzes the asymptotic efficiency of restart and checkpointing strategies for infinite sequences of tasks with failures, providing conditions for positive efficiency and identifying universal checkpoints in exponential failure models.
Contribution
It introduces a framework for evaluating asymptotic efficiency of restart and checkpointing, accounting for task size distributions, failure rates, and dependencies, and proves the existence of universal checkpoints in exponential failure scenarios.
Findings
Asymptotic efficiency depends on tail comparison of task sizes and failure distributions.
Conditions for positive asymptotic efficiency are established.
Universal checkpoints exist in exponential failure models.
Abstract
Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task is resumed from the preceding checkpoint after the failure. We study asymptotic efficiency of restart for an infinite sequence of tasks, whose sizes form a stationary sequence. We define asymptotic efficiency as the limit of the ratio of the total time to completion in the absence of failures over the total time to completion when failures take place. Whether the asymptotic efficiency is positive or not depends on the comparison of the tail of the distributions of the task size and the random variables governing failures. Our framework allows for variations in the failure rates and dependencies between task sizes. We also study a similar notion of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Queuing Theory Analysis · Optimization and Search Problems
