Optimal Checkpoint Interval with Availability as an Objective Function

Nirmal Raj Saxena; Saurabh Hukerikar; Mikolaj Blaz; Swapna Raj

arXiv:2410.18124·cs.DC·October 25, 2024

Optimal Checkpoint Interval with Availability as an Objective Function

Nirmal Raj Saxena, Saurabh Hukerikar, Mikolaj Blaz, Swapna Raj

PDF

Open Access

TL;DR

This paper simplifies the derivation of the optimal checkpoint interval focusing on maximizing system availability, showing it differs from traditional lost-time minimization but converges under certain conditions.

Contribution

It introduces an availability-based optimal checkpoint interval derivation, contrasting with the traditional lost-time approach, and analyzes its behavior under different error detection latency scenarios.

Findings

01

Availability-optimal interval differs from lost-time-optimal in general.

02

For small error detection latency, both optimal intervals are asymptotically similar.

03

Large error detection latency leads to larger optimal checkpoint intervals.

Abstract

We present a simplified derivation of the optimal checkpoint interval in Young_1974 [1]. The optimal checkpoint interval derivation in [1] is based on minimizing the total lost time as an objective-function. Lost time is a function of checkpoint interval, checkpoint save time, and average failure time. This simplified derivation yields lost-time-optimal that is identical to the one derived in [1]. For large scale-out super-computer or datacenter systems, what is important is the selection of optimal checkpoint interval that maximizes availability. We show that availability-optimal checkpoint interval is different from the one derived in [1]. However, availability-optimal checkpoint interval is asymptotically same as lost-time-optimal checkpoint interval for certain conditions on checkpoint save and recovery time. We show that these optimal checkpoint intervals hold in situations where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization · Optimization and Search Problems · Simulation Techniques and Applications