Online Checkpointing with Improved Worst-Case Guarantees

Karl Bringmann; Benjamin Doerr; Adrian Neumann; Jakub Sliacan

arXiv:1302.4216·cs.DS·May 1, 2013

Online Checkpointing with Improved Worst-Case Guarantees

Karl Bringmann, Benjamin Doerr, Adrian Neumann, Jakub Sliacan

PDF

Open Access

TL;DR

This paper improves the worst-case guarantees for online checkpointing strategies, presenting algorithms with lower discrepancy bounds and establishing new theoretical lower bounds, thus advancing the efficiency of checkpoint placement.

Contribution

It introduces new algorithms with asymptotically reduced discrepancy bounds and proves the existence of optimal algorithms for all k, surpassing previous bounds.

Findings

01

Discrepancy q_k <= 1.59 + o(1) for all k.

02

Discrepancy q_k <= 1.39 + o(1) for k being a power of two.

03

Lower bound q_k >= 1.30 - o(1).

Abstract

In the online checkpointing problem, the task is to continuously maintain a set of k checkpoints that allow to rewind an ongoing computation faster than by a full restart. The only operation allowed is to replace an old checkpoint by the current state. Our aim are checkpoint placement strategies that minimize rewinding cost, i.e., such that at all times T when requested to rewind to some time t <= T the number of computation steps that need to be redone to get to t from a checkpoint before t is as small as possible. In particular, we want that the closest checkpoint earlier than t is not further away from t than q_k times the ideal distance T / (k+1), where q_k is a small constant. Improving over earlier work showing 1 + 1/k <= q_k <= 2, we show that q_k can be chosen asymptotically less than 2. We present algorithms with asymptotic discrepancy q_k <= 1.59 + o(1) valid for all k and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Optimization and Search Problems · Advanced Data Storage Technologies