Tight Bounds on Online Checkpointing Algorithms
Achiya Bar-On, Itai Dinur, Orr Dunkelman, Rani Hod, Nathan, Keller, Eyal Ronen, Adi Shamir

TL;DR
This paper establishes tight bounds on the discrepancy of online checkpointing algorithms, solving open problems for all large and small values of k, and introduces new applications of the problem.
Contribution
It proves that ln(4) is the exact asymptotic bound for large k and provides optimal algorithms for small k, advancing understanding of online checkpointing.
Findings
Discrepancy asymptotically equals ln(4) for large k.
Optimal algorithms are identified for all k †10.
New applications of online checkpointing are described.
Abstract
The problem of online checkpointing is a classical problem with numerous applications which had been studied in various forms for almost 50 years. In the simplest version of this problem, a user has to maintain memorized checkpoints during a long computation, where the only allowed operation is to move one of the checkpoints from its old time to the current time, and his goal is to keep the checkpoints as evenly spread out as possible at all times. Bringmann et al. studied this problem as a special case of an online/offline optimization problem in which the deviation from uniformity is measured by the natural discrepancy metric of the worst case ratio between real and ideal segment lengths. They showed this discrepancy is smaller than for all , and smaller than for the sparse subset of 's which are powers of 2. In addition, they obtainedâŠ
| 2* | 1.5 | 2 | 1 | 0 | 0 | |
| 3* | 1.5278641 | 1.618037 | (1) | 1 | 2 | 1 |
| 4 | 1.5398927 | 1.8019377 | (1,3) | 2 | 7 | 3 |
| 5* | 1.4707341 | 1.7548777 | (1,3) | 2 | 36 | 13 |
| 6 | 1.5127400 | 3.627365 | (1,2,3,1,3,5) | 6 | 117 | 9 |
| 7 | 1.4974818 | 3.11201 | (1,3,4,1,5,3) | 6 | 559 | 10 |
| 8 | 1.4851548 | 10.712656 | (1,2,4,7,5,3,1,7,5,3,7,1,4,2,4,5) | 16 | 1698 | 14 |
| 9 | 1.4730721 | 3.2748095 | (1,5,3,5,1,5,6,3) | 8 | 5892 | 135 |
| 10 | 1.4678452 | 5.67943 | (1,5,3,5,1,5,6,3,1,5,9,3,5,9) | 14 | 32843 | 20 |
| 11 | 1.4650841 | 8.190656 | (1,3,5,6,1,6,2,10,6,3,6,1,6,2,6,3,9,6) | 18 |
| 12 | 1.4668421 | 8.862576 | (1,2,3,5,6,7,1,2,6,3,6,7,1,2,6,3,6,9,7) | 19 |
| 13 | 1.4592320 | 2.94 | (1,3,6,7,4,7,1,7,8,3) | 10 |
| 14 | 1.4570046 | 58.6 | (1,4,2,6,7,4,7,8,1,8,2,3,7,12,4,7,8,1,4,7,2,7,8,4,13,8, â1,8,4,2,7,4,7,8,1,8,4,2,7,12,4,7,13,8) | 44 |
| 15 | 1.4487459 | 2.104027 | (1,2,7,8,4,8,9,5) | 8 |
| 16 | 1.4487597 | 8.46 | (1,2,4,7,8,9,5,9,1,2,8,4,8,9,5,9,1,2,8,4,8,13,9,5,9) | 25 |
| 17 | 1.4593611 | 1.694884 | (1,9,5,3,14,8,9) | 7 |
| 18 | 1.4575670 | 2.57 | (1,8,9,5,9,10,2,5,9,10,3,5) | 12 |
| 19 | 1.4592194 | 2.45 | (1,9,5,9,10,2,5,9,10,11,3,5) | 12 |
| 20 | 1.4696048 | 13.3 | (1,5,9,10,2,5,9,10,11,3,5,10,1,5,9,10,11,2,5,10,11,3,5,10, â1,5,9,10,6,2,9,10,11,3,5,10) | 36 |
| Efficiency | is smallest root of | % | |||
|---|---|---|---|---|---|
| 2 | 0 | 1 | 2 | ||
| 4 | 1 | 1.527864045 | 2.61803399 | 38.63% | |
| 8 | 2 | 1.446619893 | 2.220744085 | 24.69% | |
| 16 | 3 | 1.414522345 | 2.096981559 | 15.40% | |
| 32 | 4 | 1.399982156 | 2.045751025 | 9.337% | |
| 64 | 5 | 1.393037798 | 2.022250526 | 5.520% | |
| 128 | 6 | 1.389641669 | 2.010975735 | 3.196% | |
| 256 | 7 | 1.389039657 | 2.006538067 | 2.996% | |
| 512 | 8 | 1.38976776 | 2.005370202 | 4.265% | |
| 1024 | 9 | 1.389961428 | 2.004616597 | 5.003% | |
| 2048 | 10 | 1.389901672 | 2.004083324 | 5.413% | |
| 4096 | 11 | 1.3897339 | 2.003678733 | 5.631% | |
| 8192 | 12 | 1.389529892 | 2.003356204 | 5.738% | |
| 16384 | 13 | 1.389323191 | 2.003090123 | 5.785% | |
| 32768 | 14 | 1.389128152 | 2.002865287 | 5.799% | |
| 65536 | 15 | 1.388949844 | 2.002671985 | 5.796% | |
| 131072 | 16 | 1.388789052 | 2.002503614 | 5.786% | |
| 262144 | 17 | 1.388644741 | 2.002355444 | 5.772% |
| efficiency | is smallest root of | % | |||
|---|---|---|---|---|---|
| 3 | 0 | 1.145898034 | 2.058171027 | ||
| 7 | 1 | 1.318433761 | 2.075892596 | ||
| 15 | 2 | 1.408092224 | 2.09450451 | 11.62% | |
| 31 | 3 | 1.448810165 | 2.099878619 | 42.25% | |
| 63 | 4 | 1.46399549 | 2.097270918 | 63.36% | |
| 127 | 5 | 1.466865403 | 2.091122952 | 76.82% | |
| 255 | 6 | 1.464278319 | 2.083917032 | 85.05% | |
| 511 | 7 | 1.459602376 | 2.076835761 | 89.98% | |
| 1023 | 8 | 1.454408143 | 2.070357915 | 92.91% | |
| 2047 | 9 | 1.449377613 | 2.064618545 | 94.66% | |
| 4095 | 10 | 1.444769023 | 2.059600382 | 95.72% | |
| 8191 | 11 | 1.440647199 | 2.055228333 | 96.39% | |
| 16383 | 12 | 1.436994729 | 2.051413108 | 96.83% | |
| 32767 | 13 | 1.43376402 | 2.048069607 | 97.14% | |
| 65535 | 14 | 1.430900622 | 2.045123383 | 97.36% | |
| 131071 | 15 | 1.428352881 | 2.042511814 | 97.54% | |
| 262143 | 16 | 1.426075306 | 2.040183169 | 97.69% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Tight Bounds on Online Checkpointing Algorithms
Achiya Bar-On
Department of Mathematics, Bar-Ilan UniversityRamat-GanIsrael
,Â
Itai Dinur
Computer Science Department, Ben-Gurion UniversityBeer-ShebaIsrael
,Â
Orr Dunkelman
Computer Science Department, University of HaifaHaifaIsrael
,Â
Rani Hod
Department of Mathematics, Bar-Ilan UniversityRamat-GanIsrael
,Â
Nathan Keller
Department of Mathematics, Bar-Ilan UniversityRamat-GanIsrael
,Â
Eyal Ronen
Computer Science Department, The Weizmann InstituteRehovotIsrael
 andÂ
Adi Shamir
Computer Science Department, The Weizmann InstituteRehovotIsrael
Abstract.
The problem of online checkpointing is a classical problem with numerous applications which had been studied in various forms for almost years. In the simplest version of this problem, a user has to maintain memorized checkpoints during a long computation, where the only allowed operation is to move one of the checkpoints from its old time to the current time, and his goal is to keep the checkpoints as evenly spread out as possible at all times.
Bringmann et al. studied this problem as a special case of an online/offline optimization problem in which the deviation from uniformity is measured by the natural discrepancy metric of the worst case ratio between real and ideal segment lengths. They showed this discrepancy is smaller than for all , and smaller than for the sparse subset of âs which are powers of 2. In addition, they obtained upper bounds on the achievable discrepancy for some small values of .
In this paper we solve the main problems left open in the above-mentioned paper by proving that is a tight upper and lower bound on the asymptotic discrepancy for all large , and by providing tight upper and lower bounds (in the form of provably optimal checkpointing algorithms, some of which are in fact better than those of Bringmann et al.) for all the small values of .
In the last part of the paper we describe some new applications of this online checkpointing problem.
Checkpoints, Online Algorithms, Competative Analysis
â â ccs: Theory of computation Approximation algorithms analysis
1. Introduction and Notation
Most programs perform some irreversible operations, and thus they can only be run in a forward direction. However, in many cases we would like to roll back a computation to an earlier point in time. When the computation is short, we can just rerun the computation from the beginning, but when the computation requires many days, a better strategy is to memorize several copies of the full state of the computation at various times. These memorized states (called checkpoints) make it possible to roll the computation back from time to any earlier time by restarting the computation from the last available checkpoint which was memorized before . This checkpointing technique is extremely useful in many real life applications: For example, when we want to interactively debug a new program we may want to randomly access earlier points in the execution in order to find the source of a problem; in fault tolerant computer systems we may want to undo the effects of a faulty hardware; and during lengthy simulations of physical systems we may want to explore the effect of changing some parameter, such as the temperature, at some earlier point in time without rerunning the simulation from the beginning.
In principle, we can try to memorize the full state of the computation after each step, but for long computations this requires an unrealistic amount of memory. Instead, we assume that we have some bounded amount of memory which suffices to keep checkpoints. At time , these checkpoints are spread within the time interval , dividing it into subintervals between consecutive checkpoints (where the endpoints [math] and can be viewed as virtual checkpoints which require no additional memory). As increases, the last subinterval gets longer, and at some point we may want to relocate one of the old checkpoints by reusing its memory to store the current state of the computation. A checkpointing algorithm can thus be viewed as an infinite pebbling game in which we place pebbles on the positive side of the time axis, and then repeatedly perform update operations which move one of the pebbles to the right of all the other pebbles.
The first paper dealing with this problem seems to be âRollback and Recovery Strategies for Computer Programsâ (Chandy and Ramamoorthy, 1972), published in 1972, while the first paper which tried to solve it optimally was âOn the Optimum Checkpoint Intervalâ (Gelenbe, 1979), published in 1979. Over the years, dozens of academic research papers were published in this area, most notably (Toueg and Babaoglu, 1984) in 1984, (Bern et al., 1994) in 1994, and (Ahlroth et al., 2013; Bringmann et al., 2013) in 2013. However, many of these papers either dealt with concrete applications of the problem in other areas (especially in distributed computing where the notion of a timeline is different), or used other optimization criteria (which make their optimal solutions incomparable with ours). The mathematical problem we are dealing with in this paper was mentioned in (Ahlroth et al., 2013) and studied in (Bringmann et al., 2013), and we closely follow their model and notation.
At any time , we define a snapshot as the ordered sequence of current checkpoint locations . Within each snapshot, we refer to the checkpoints by their freshness index , where checkpoint stores the oldest state and checkpoint stores the newest state. Starting from an initial snapshot , we define for every the -th update action as a pair in which is the freshness index of the checkpoint whose memory we want to reuse by moving it to time . A typical example of how one snapshot is transformed into another snapshot by an update operation is described in Fig. 1. The effect of the -th update action is to unify the two consecutive subintervals which were separated by the -th oldest active checkpoint at time , and to create a new subinterval which ends at . Note that with this notation, each update action affects multiple freshness indices within the snapshot; in particular, the freshness index of active checkpoint for is left unchanged, and is decreased by one for . To demonstrate this point, consider a sequence of updates in which for all : it updates the memory locations in a round robin way since it always updates the oldest active checkpoint by overwriting it with the newest checkpoint, shifting all freshness indices by one. On the other hand, a sequence of updates in which for all keeps updating the same memory location, pushing its associated checkpoint further and further to a later time, with no change to the other checkpoints.
In this model, the time complexity of rolling back a computation from time to time is assumed to be proportional to the distance between and the last checkpoint that precedes in the snapshot at time , and thus its worst case happens when we decide to roll back to just before the end of the longest subinterval. A checkpointing algorithm consists of a monotonically increasing and unbounded sequence of update times and a pattern sequence , forming an initial snapshot and an infinite sequence of update actions; its goal is to make the length of this longest subinterval as short as possible. Clearly, no checkpointing algorithm can make this length shorter than the subinterval length in a perfectly uniform partition of , which is . We say that a snapshot of a -checkpoint algorithm is -compliant at time if the subintervals defined by satisfy for ,111We write and for convenience. and that Alg is -efficient if its snapshots are -compliant at all times . Finally, the efficiency of a checkpointing algorithm is defined as the smallest for which it is -efficient.
Notice that the problem of efficient checkpointing can be viewed as a special case of an online/offline optimization problem: If we knew in advance the time at which we would like to roll back the computation, we could make each subinterval as small as . However, in the online version of the problem, we do not know in advance, and thus we have to position the checkpoints so that they will be roughly equally spaced at all times. The efficiency of the solution is the ratio between what we can achieve in the online and offline cases, respectively, and the goal of the online checkpointing problem is to find the smallest possible efficiency achievable by the best -checkpoint algorithm for any given .
Clearly, for all , and cannot be too close to since any snapshot in which all subintervals have roughly the same length will be transformed by the next update operation to a snapshot in which one of the subintervals will be the union of two previous subintervals, and thus will be about twice as long as the other subintervals.222Actually since subinterval has zero length upon updating, as noted in (Ahlroth et al., 2013, Theorem 3). On the other hand, there is a very simple subinterval doubling algorithm from (Ahlroth et al., 2013, Section 3.1) which is 2-efficient: Assuming WLOG that is even, the algorithm starts with the snapshot , and performs the sequence of update actions , yielding the snapshot . Since this snapshot is the same as the original snapshot up to a scaling factor of 2, we can continue with update actions and so on. This is a cyclic algorithm, repeating the same sequence of freshness indices again and again but with times which form a geometric progression. As in each snapshot there are only two possible lengths for the subintervals of the form and , all the snapshots in this algorithm are 2-compliant, and thus the algorithm is 2-efficient.
The best strategy for keeping the checkpoints as uniform as possible at all times is thus to keep in each snapshot a variety of subinterval lengths, so that the algorithm will always be able to join two relatively short adjacent subintervals into a single subinterval which is not too long. This can be viewed as a generalization of the algorithm that creates Fibonacci numbers: Whereas the standard algorithm is always adding the last two numbers and placing their sum on the right, in our case we can add any two consecutive numbers in the sequence, replacing them by their sum and adding any number we want on the right. Analyzing this problem is surprisingly difficult, and so far there had been no tight bounds on the best possible efficiencies of online checkpointing algorithms in this model.
The main results in (Bringmann et al., 2013) are two online checkpointing algorithms whose asymptotic efficiencies are for the sparse subset of âs which are powers of 2, and for general . In addition, they proved in their model the first nontrivial asymptotic lower bound of . However, since the upper and lower bounds did not match, it was not clear whether the checkpointing algorithms they proposed were asymptotically optimal. For small values of they presented concrete checkpointing algorithms whose efficiencies were all below , but again it was not clear whether they were optimal.
In this paper we solve the main open problems related to the mathematical formulation of the problem which was defined and studied in (Bringmann et al., 2013). In particular, we develop a new checkpointing algorithm with an asymptotic efficiency of for all values of , and prove its optimality by providing a matching asymptotic lower bound. For all the small values of we develop optimal checkpointing algorithms by proving tight upper and lower bounds on the achievable efficiency for these âs. This analysis enables us to show that for some values of (such as ), the algorithms presented in (Bringmann et al., 2013) are in fact suboptimal.
The rest of this paper is organized as follows. In Section 2 we go over basic observations about checkpointing algorithms (some from (Bringmann et al., 2013), some new). In Section 3 we focus on moderately small values of and provide optimal algorithms for . In Section 4 we construct a recursive algorithm of asymptotically optimal efficiency . In Section 5 we prove a matching asymptotic lower bound of . In Section 6 we present two new applications of online checkpointing. In Section 7 we provide concluding remarks.
2. Basic Observations
By definition, a -checkpoint is -efficient if and only if its snapshots at all times are -compliant. However, as noted in (Ahlroth et al., 2013, Lemma 2) (and also (Bringmann et al., 2013, Lemma 1)), it suffices to verify compliance only at the discrete times . It makes sense thus to only consider âstandardâ snapshots taken at time for . Moreover, as shown in (Bringmann et al., 2013, Lemma 2), besides compliance of the initial snapshot , it suffices to verify compliance of just two subintervals of for every â subinterval , that ends in the new checkpoint , and subinterval , created by merging two consecutive subintervals.
The following two observations about the sequence were mentioned in (Bringmann et al., 2013, Section 6) without proof.
Property 1.
Without loss of generality we can assume a -checkpoint algorithm updates the least recent checkpoint infinitely often (i.e., ).
Proof.
Fix a -efficient algorithm and consider its standard snapshots for . For notational convenience let for and . The update times sequence is unbounded and , so there must exist some minimal for which the sequence is unbounded. If we are done; otherwise we show how to modify the algorithm, while maintaining -efficiency, such that would be unbounded as well.
Let and pick such that . No further update action can update a checkpoint of freshness index since that would result in . In particular, we have for all . Pick such and . It is possible since is unbounded and cannot decrease unless checkpoint is updated. We modify the algorithm to update checkpoint instead of at time , that is, set . Instead of creating a -compliant subinterval of length
[TABLE]
this update action now creates a subinterval of length
[TABLE]
which is still -compliant. The algorithm remains -efficient since future updates only touch subintervals and did not grow. This process can be repeated as long as remains finite. Note that in the limiting algorithm, checkpoint is updated infinitely often so cannot be finite; thus must be unbounded. â
Remark 1.
An important consequence of Property 1 is that we can essentially ignore the compliance of the initial snapshot by rebasing, i.e., running the algorithm until all checkpoints present in are overwritten and treating the then-current snapshot as the new initial .
Property 2.
Without loss of generality we can assume a -checkpoint algorithm never updates the most recent checkpoint (i.e., for all ).
Proof.
Fix a -efficient algorithm . If then the -th update transforms the snapshot to . We have since subinterval in is -compliant, so for any we have . Hence we can simply skip the -th update action altogether and the algorithm remains -efficient. â
Remark 0.
Property 2 means that the two last checkpoints in snapshot are and , and thus subinterval is -compliant if and only if , that is, , where . We refer to this condition by saying that the update times sequence should be -subgeometric.
Next we introduce the notion of cyclic algorithms. Upper bounds on presented in this paper, as well as in (Ahlroth et al., 2013; Bringmann et al., 2013), are all achieved by cyclic algorithms. Given a positive integer and a real number , a -checkpoint algorithm is -cyclic if for all and for all . It has been observed in (Bringmann et al., 2013, Lemma 5) that any -efficient -cyclic algorithm must satisfy (to see this, apply subgeometry times). An -cyclic algorithm is called -geometric when (and thus for ).
We finish this section with two observations about the exponential growth of update times in efficient algorithms, relevant for upper and lower bounds on .
The first one is an improvement of (Bringmann et al., 2013, Lemma 8):
Property 3.
Any -efficient -checkpoint algorithm satisfies, without loss of generality, for all .
Proof.
Starting with a -efficient algorithm, we consider checkpoint updates in their natural order and modify the algorithm, while maintaining efficiency, such that the property holds. At each step, the only modifications we make are to skip or delay an update, which ensures that is still unbounded. The basic simple idea in all modifications is that if an update that merges two subintervals is possible at some time , i.e., the newly created subinterval is -compliant, then an update that merges the same subintervals is also possible at any time .
If the property does not hold, consider the smallest for which . We show we can either skip one of the updates at times , or delay the update at time until time . For , denote by the next time at which the checkpoint at is updated, and by the label of the actual checkpoint updated.333In contrast to a temporary index of the checkpoint in the checkpoint sequence at some snapshot , a label of a checkpoint is fixed. Note that by Property 2, and that the order of and is undetermined.
If (the checkpoint at is in place by the time the checkpoint at is updated), then eliminate the checkpoint update at (keeping the sequence -subgeometric as ) and switch roles between the two checkpoints throughout the rest of the algorithm. Namely, update from its previous update (before being updated at ) directly at time ,444If checkpoint was firstly used at , then simply use first at in the modified algorithm. and update from its previous update directly at time (and then again at and so forth).
Otherwise, ; consider the time interval and denote by the checkpoint that is removed from it at the latest time before (note that there is at least one such checkpoint, namely , hence it may be that ). Delay the update of to time and skip all other updates in that are removed from this time interval before . In other words, for each checkpoint (except ) updated in the time interval and updated again in the time interval , update it directly from its previous update (before ) to its next one (after ). â
An immediate corollary of Property 3 is that for ; in particular, for . Our next observation says when we can get .
Property 4.
Let be a snapshot of some -efficient -checkpoint algorithm such that and for some . Thus, without loss of generality, .
Proof.
Starting with a -efficient algorithm that satisfies Property 3, we modify it such that Property 3 holds as well. As in the previous proof, for , denote by the label of the checkpoint at and denote by the time by which it is removed from the time interval .
We show that if then we can delay the update at time until and eliminate the update at time . Note that by Property 3, hence after this change Property 3 continues to hold.
If we switch the roles of the two checkpoints by updating directly at and at and again at ; otherwise, and we simply delay the update of to time and skip the update at by updating directly at from its previous update. â
3. Optimal Algorithms for Small Values of
3.1. Round-robin and
We now analyze the efficiency of the Round-Robin algorithm, which is geometric and always updates the oldest checkpoint (i.e., for all ).555The case of Round-Robin was considered in (Bringmann et al., 2013, Theorem 1) under the name Simple. Besides serving as a first example, Round-Robin is optimal for and will make an appearance within the asymptotically optimal algorithm Recursive of Section 4.
Proposition 3.1.
The efficiency of -checkpoint Round-Robin is , where is the smallest real root of .
Proof.
Denote by the sequence of update times. A snapshot of Round-Robin is . For the algorithm to be -efficient for some , update times need to satisfy as well as the subgeometric conditions implying that . This is possible if and only if . Moreover, for any , the efficiency of Round-Robin using a geometric update times sequence is . This is indeed minimized by choosing . â
Remark 0.
Round-Robin* is pretty bad for large ; indeed, is asymptotically inferior to the simple bound from the introduction.*
The case is made obvious by Property 2, since without loss of generality Round-Robin is the only 2-checkpoint algorithm to consider. Thus .
Proposition 3.2.
For we have , where is the smaller root of .
Proof.
For the upper bound, Round-Robin is -efficient. Note that is the golden ratio.
For the lower bound, consider a -efficient -checkpoint algorithm and a snapshot . By subgeometry we must have and for subinterval 1 to be compliant we need , which together imply , i.e., . Thus . â
Round-Robin is no longer optimal for . Indeed, cyclic algorithms with better efficiency were described in (Bringmann et al., 2013, Figure 3) for . These provide upper bounds on , respectively. Nevertheless, no formal proof of optimality was provided.
Remark 0.
These algorithms were found by the use of linear programming, which is thoroughly discussed in Section 3.2.
For the optimal algorithms are -cyclic; is geometric while is not.
Proposition 3.3.
For we have , where is the (only) real root of .
Proof.
For the upper bound, set and consider the -geometric algorithm with pattern and initial snapshot . The algorithm is indeed -cyclic, as and . Due to being geometric, we just need to verify compliance of subinterval 3 in : and subinterval 1 in : . Altogether it remains to show that
[TABLE]
both of which are equal to .
For the lower bound, consider a -efficient -checkpoint algorithm. It cannot be Round-Robin, which satisfies and in particular . We thus pick a snapshot followed by update steps and . Thus is one of , , or . In either case, compliance of , and implies , which together with yields , i.e., . Thus . â
Proposition 3.4.
For we have , where is the smallest root of . Moreover, the efficiency of any geometric 4-checkpoint algorithm is at least , where is the real root of .
Proof.
For the upper bound, let be the largest root of and consider the -cyclic algorithm with pattern starting at . The algorithm is indeed -cyclic, as and . The update times sequence is -subgeometric if and ; furthermore we need to verify compliance of subinterval 3 in : and subinterval 1 in : . Altogether it remains to show
[TABLE]
Indeed the last three are equal to while the first is
[TABLE]
using since .
For the lower bound, consider a -efficient 4-checkpoint algorithm. If it is Round-Robin, it satisfies so . Otherwise pick a snapshot followed by update steps and . The snapshot is or . The snapshot after the next update step can be one of four:
- âą
If then by compliance of and , yielding , i.e., ;
- âą
If is or then by compliance of and , or by compliance of ; either way yields , which again implies ;
- âą
The only remaining option is , which means , i.e., . Now
[TABLE]
by compliance of all four snapshots, so
[TABLE]
hence , i.e., , which implies .
Note that a -geometric update times sequence would satisfy in the last case, yielding one last time and implying . â
3.2. Casting the problem as a linear program
Fix and an update pattern . Can we choose a sequence of update times such that the resulting -checkpoint algorithm is -efficient?
Each snapshot consists of a particular subset of the variables , and using we can determine exactly which. Furthermore, all constraints (e.g., monotonicity, subgeometry, compliance) can be expressed as linear inequalities. This gives rise to an infinite linear program , which is feasible whenever a -efficient algorithm with the prescribed pattern exists. Note that all constraints are homogeneous, so to avoid the zero solution we add the non-homogeneous condition .
In addition, we are not interested in solutions where is bounded. This can happen, for instance, when for all .666This may not seem a valid pattern to consider, given Property 1; however, when solving a finite subprogram we might have to consider an arbitrarily long prefix of the pattern with no occurrences of 1. Luckily, by using Property 3 we can restrict our attention to exponentially increasing sequences , so we add to the linear inequalities from Properties 3 and 4. Now is feasible if and only if a -efficient algorithm with the prescribed pattern exists; in other words, is the infimum777This infimum is actually a minimum, by (Bringmann et al., 2013, Theorem 8) and also by Proposition 3.5. over for which there exists a pattern such that is feasible.
As an infinite program, is not too convenient to work with. We can thus limit our attention to finite subprograms for some , which only involve the variables and the relevant constraints. Finite subprograms can no longer ensure the existence of a -efficient algorithm, but can be used to prove lower bounds on in the following way. Write and consider the set of strings, i.e, finite sequences over .
Definition 0.
A string is called a -witness if is infeasible. A string set is called blocking if any infinite sequence over contains some as a substring.
Property 5.
If there exists a blocking set of -witnesses for some , then .
Remark 0.
The lower bound of Property 5 holds for all algorithms, cyclic or not.
We now describe a strategy to approximate to arbitrary precision. For the lower bound we use Property 5; for the upper bound, we limit our focus to cyclic algorithms. Given and a string of length , we can augment with equality constraints ; call the resulting program . This is a finite linear program, which we can computationally solve given , , and . Although is not known to us, we can first compute an approximation of by solving , and then solve . Using binary search, we can compute a numerical approximation of the minimal for which is feasible. Lastly, we can enumerate short strings in a BFS/DFS-esque manner and take the best obtained.
To demonstrate this strategy, we computed up to 7 decimal digits, using a Python program employing GLPK (Free Software Foundation, 2012) via CVXOPT (Andersen et al., 2016) (see Table 1; starred values of are geometric algorithms).
At first it seems that Property 5 cannot be used to pinpoint exactly, since any finite blocking set of -witnesses for some leaves an interval of uncertainty of length . The following proposition eliminates this uncertainly.
Proposition 3.5.
For every string there is a finite set such that the feasibility of for some only depends on the relative order between and members of . In particular, there exists some such that if is feasible and is a -witness, then is also a -witness for all .
Proof.
Fix . Treating as a parameter, note that the subprogram is feasible if and only if its feasible region, the convex polytope , is nonempty. Decreasing shrinks until some critical for which is reduced to a single vertex, at which a subset of the linear constraints are satisfied with equality. Hence is a solution of some polynomial equation determined by the relevant constraints. The set of constraints is finite, thus there are finitely many polynomial equations that can define , and we can take as the set of all their roots. Now take to be smaller than the distance between any two distinct elements of . â
Remark 0.
Note that when is small enough, we can actually retrieve the polynomial equations defining and from the polytope \mathcal{P}^{*}\big{(}\tilde{\lambda},\tilde{\gamma};B\big{)}; using this method we get an algebraic representation of rather than a rational approximation. To demonstrate, , where is the smallest real root of
[TABLE]
4. Asymptotically Optimal Upper Bounds
In this section we describe a family of geometric -checkpoint algorithms. Despite our experience from Table 1âthat only for optimal algorithms are geometricâthis family is rich enough to be asymptotically optimal, i.e., -efficient.
4.1. A recursive geometric algorithm
Fix a real number and an integer . We describe a -checkpoint algorithm Recursive, where is an -subset whose elements are
[TABLE]
Recursive is -geometric, and its update pattern is defined as , where is the largest for which divides . It is easy to see that is -periodic, and we can just refer to . As per Remark 1, via rebasing there is no need to explicitly define the initial snapshot .
Example 0.
For we get .
True to its name, Recursive can be viewed also as a recursive algorithm: the base case (i.e., ) is simply -checkpoint Round-Robin; for , Recursive alternates between updating the -st oldest checkpoint and between acting according to the inner -checkpoint algorithm Recursive.
Let us elaborate a bit more on the recursive step. In every snapshot we have for since we never update checkpoints younger than . In every odd snapshot we have just updated the -st oldest checkpoint, so while . This means that for all have the same parity as in any snapshot . We thus treat as a snapshot of a -checkpoint algorithm, which operates at half speed and never sees half of the checkpoints. The inner algorithm can rightfully be called Recursive, as the common ratio of the update times sequence for the checkpoints that do make it to the inner algorithm is , and taking only the even locations of yields a periodic sequence such that .
4.2. Analyzing the recursive algorithm
First we determine exactly how efficient Recursive can be for any and , and then we work with a particular choice.
Denote by the maximum of
[TABLE]
where
[TABLE]
Theorem 4.1.
Given and , the efficiency of Recursive is .
Proof.
Write and denote by the efficiency of Recursive. Our aim is to show that ; namely, that Recursive is -efficient, but is not -efficient for any .
We proceed by induction on . The base case is Round-Robin, whose efficiency by the proof of Proposition 3.1 is . Assume now and consider a critical subinterval of length in a snapshot , taken at time . This subinterval can be of one of three types:
- (1)
The last one, between and . Here, by (1a),
[TABLE] 2. (2)
One created by merging two smaller subintervals in an odd888When is odd (even) we call the snapshot odd (even). snapshot, between and . Here, by the case of (1b)
[TABLE] 3. (3)
One created by merging two smaller subintervals in an even snapshot. This subinterval is among , which we treat as a snapshot of Recursive, taken at time . Denote the efficiency of Recursive by ; thus
[TABLE]
with equality whenever this subinterval is critical for Recursive. By the induction hypothesis, is the maximum of
[TABLE]
where
[TABLE]
To show , it thus remains to show that .
- (i)
If then by case of (1b); 2. (ii)
If
[TABLE]
then by the remaining cases of (1b); 3. (iii)
If then by (1c).
Now assume Recursive is -efficient for , where , i.e., is strictly smaller than (1a), (1b) and (1c). The critical subinterval of length cannot be of type 1 or 2, due to (1a) and case of (1b), respectively, so it must be of type 3. But then we get strict inequalities for the inner algorithm, so actually , contradicting the induction hypothesis. â
Given an integer , let . Define by for and . Note that and that .
Theorem 4.2.
Recursive* is -efficient for large enough , where and*
[TABLE]
Proof.
By Theorem 4.1, it suffices to verify that , that is, for sufficiently large . Clearly (1a) holds since for all . It remains to verify (1b) and (1c), handled by Propositions 4.3 and 4.4 respectively. â
Proposition 4.3.
For and as above, for all .
Proposition 4.4.
For and as above, .
Remark 0.
Theorem 4.2 chooses suboptimally. Empirical evidence shows that, for all , the optimal for Recursive satisfies (1a) and one of (1b) and (1c). In other words, it is the smallest root of either or for some (see also Tables 3 and 4).
Remark 0.
With additional effort the constant in Theorem 4.2 can be improved by a factor of almost 6 to . The major obstacle is that cases and of (1b) need to be done separately since the appropriate in the proof of Proposition 4.3 is negative for . No proof is possible for since then (1c) would be violated for large enough .
Remark 0.
We verified that the algorithm Recursive is -efficient for as well. See Tables 3 and 4 for (best case) and (worst case), respectively.
Before proving Propositions 4.3 and 4.4 we would like to simplify for our .
Claim 0.
* for ; moreover, .*
Proof.
We have
[TABLE]
The slightly improved bound for is obtained by observing that in the above calculation since . â
Write and observe that and for our choice of .
Proof of Proposition 4.4.
By the claim above
[TABLE]
so
[TABLE]
where the first inequality is true since for and indeed for we have and . â
Proof of Proposition 4.3.
By the claim above
[TABLE]
using for all , and taking . Thus, it remains to show that
[TABLE]
Let and note that as . Define
[TABLE]
to conclude the proof we show that is positive for all and .
First we compute some partial derivatives of .
[TABLE]
Now everywhere, so is increasing and for all . Next for , so is decreasing and
[TABLE]
for all . Lastly, is concave in as everywhere. Thus
[TABLE]
for all . â
5. Asymptotically Optimal Lower Bounds
In this section we prove lower bounds on , focusing on asymptotic lower bounds in which grows to infinity.
We start by reproving the simple asymptotic lower bound of (Bringmann et al., 2013, Theorem 6), and then improve it to , which is asymptotically optimal via the matching upper bound of Section 4.
5.1. Stability and bounding expressions
Obtaining lower bounds requires viewing the problem from a different perspective. It will sometimes be more convenient to refer to a certain physical checkpoint, without considering its temporary freshness index in the checkpoint sequence at some snapshot (which is variable and depends on ).
Given a -checkpoint algorithm, we define a function and use it to bound its efficiency from below. The parameter is related to the notion of stability, which we now define.
Definition 0.
Fix a -checkpoint algorithm. A checkpoint updated at time is called -stable, for some , if at least previous checkpoints are updated before the next time it is updated.
By Property 1 we can assume all checkpoints get updated eventually; this means that in a snapshot , where is a time by which all checkpoints have been updated from the initial snapshot, we have that the checkpoint updated at time is -stable for .
For convenience, the proofs in this section assume the update times sequence is normalized by a constant. This is captured by the following definition.
Definition 0.
A -checkpoint algorithm is called -normalized if an -stable checkpoint is updated at time .
Given an -normalized -checkpoint algorithm, we define a sequence of times as follows: for is the time at which the -th checkpoint is removed from . In other words, is the time at which some checkpoint is updated; is the time at which we update the next checkpoint that was previously updated in (but not at ), and so forth. Note that the checkpoint updated at time is not updated at any time for by the definition of stability. Now we are ready to define .
Definition 0.
The -truncated bounding expression of an -normalized -checkpoint algorithm is , where .
The bounding expression plays a crucial role in proving lower bounds, based on Proposition 5.1 below. We note that the truncated bounding expression only depends on the algorithmâs behavior until time , and hence the bounds that can be obtained from it are not tight for . Nevertheless, the lower bound we obtain using in Corollary 5.6 is asymptotically optimal, since the gap between it and the upper bound of Theorem 4.2 tends to zero as grows to infinity.
Remark 0.
It is possible to analyze beyond and obtain tight lower bounds for larger values of . However, there is no asymptotic improvement and the analysis becomes increasingly more technical as grows.
5.2. Asymptotic lower bound of
To simplify the analysis, we assume is even. It can be extended to cover odd values of as well, but this gives no asymptotic improvement since for all , so we only lose an error term of , which is of the same order as the error terms in Corollaries 5.4 and 5.6.
To simplify our notation we write throughout this section.
Proposition 5.1.
Any -normalized -efficient -checkpoint algorithm satisfies .
Proof.
At time , the time interval contains subintervals of length , giving rise to the inequality . At time , a checkpoint is removed from and it now contains one subinterval of length (two previous subintervals, each of length , were merged), and subintervals of length , giving .
At time , an additional checkpoint is removed from the time interval , hence it must contain an subinterval of length formed by merging two previous subintervals. We obtain , since the remaining subintervals must include subintervals of length and one (additional) subinterval of length at most . Note that this claim holds regardless of which checkpoint is updated at , and it holds in particular in case one of the subintervals merged at time contains the subintervals merged at (in fact, this case gives the stronger inequality ).
In general, for , at time the time interval must contain distinct subintervals of lengths for , and subintervals of length . This gives the inequality
Let be the largest index such that , so for all and for all . Now at time we have
[TABLE]
â
Now we need an upper bound on the bounding expression. For the simpler lower bound of we use the following proposition.
Proposition 5.2.
Any -normalized -efficient -checkpoint algorithm satisfies for .
Proof.
At time , all subintervals are of length at most . Since , for any the time interval must consist of at least subintervals, implying that the -th checkpoint was removed from the time interval by time . â
Proposition 5.3.
Let . Any -normalized -efficient -checkpoint algorithm satisfies
Corollary 5.4.
For all even we have .
Proof.
Fix a -normalized -efficient -checkpoint algorithm, and let . By Propositions 5.1 and 5.3 we have
[TABLE]
â
Proof of Proposition 5.3.
For , we have , but we cannot assure that for . By Proposition 5.2,
[TABLE]
Now for the sum on the right-hand side can be bounded by
[TABLE]
establishing the proposition. â
5.3. Improved asymptotic lower bound of
We now improve the asymptotic lower bound to . This result is a simple corollary of the following lemma, which gives a tighter upper bound on the bounding expression. Recall that and thus .
Lemma 5.5.
For any -normalized -efficient -checkpoint algorithm such that and we have .
Proof.
The proof is by induction on . For the base case , one checkpoint is updated at time at most , giving .
Assume the hypothesis holds for all and our goal is to prove it for . Without loss of generality we assume the algorithm satisfies all properties of Section 2. Consider a snapshot at time , and denote by the update time of the first checkpoint in the time interval at the snapshot time . We would like to apply the induction hypothesis from time , but this cannot be done directly since it is not guaranteed that the checkpoint last updated at in the snapshot at time is -stable (potentially, less than checkpoints are removed from at ). To overcome this problem, recall that the truncated bounding expression only considers the algorithm up to time by setting . Consequently, we can analyze a slightly different algorithm with the same bounding expression in which the checkpoint at in time is -stable.999There are other ways to solve the problem and apply the induction hypothesis, e.g., by extending the definition of a stable checkpoint. However, this seems to require slightly more complex definitions and induction hypothesis. The modification is simple: if the original algorithm removes checkpoint from in , no change is required; otherwise, and the modified algorithm would simply remove additional arbitrary checkpoints from at time . This transformation leaves unchanged, and we can analyze it instead. Note that the modified algorithm maintains all properties of Section 2 at times .
We first consider the case in which there is no checkpoint update in the time interval , implying that . We can now apply the induction hypothesis from with since at least checkpoint are removed from before the checkpoint at is updated again, namely, the checkpoint at is -stable (we have an -normalized checkpoint algorithm). Therefore
[TABLE]
Note that the multiplication of with undoes the normalization of the bounding expression at time , and the addition with is because should account for , but should not.
We also note that this actually proves a (slightly) stronger result, since when calculating from , we do not add terms larger than , but when calculating from , the restriction is looser, i.e., not adding terms larger than . Therefore, if actually contains terms in the time interval , then is strictly smaller than .
We are left to prove the hypothesis for given that there is at least one checkpoint update in the time interval . Since there in no checkpoint in the time interval in the snapshot at , then . Therefore, and by Property 4 there is exactly one update in . Therefore, the update in occurred at time , and we denote by the label of the actual checkpoint involved. Furthermore, we have .101010The only use of truncating the bounding expression at in the proof is to limit the number of updates in to one.
Denote by the time (after ) of the next update of (). After time , all checkpoints were removed from , hence . As in the previous case, we apply the induction hypothesis from with since we are assured that at least checkpoints are removed from before is updated (including ), hence the checkpoint updated at is -stable. We get . Note that we add to the right hand side (to bound ) since is first updated at after , and not at , which is the time it is first updated after (as considered in ). Once again, we prove a slightly stronger result than required, as calculated from may contain terms which are larger than 2.
Recalling that , we obtain
[TABLE]
so to show that , it is sufficient to show that .
Obviously ; there are 3 checkpoint updates in the time interval , so by Property 3, , and thus ; lastly, so by the lemmaâs assumption, which gives , as . This completes the induction and the proof of the lemma. â
Corollary 5.6.
For all even we have . In particular, .
Proof.
Write and assume for the sake of contradiction that . By Lemma 5.5 and Proposition 5.1 we have for a -normalized -efficient -checkpoint algorithm, so , contradicting our assumption. Now
[TABLE]
hence . The last inequality is true when , but we already know that . â
6. Additional Applications of Checkpointing Algorithms
Most of the applications of online checkpointing algorithms described so far in the literature are related to fault tolerance: If we discover an error in a lengthy computation, we may want to correct it without restarting the computation from the beginning. In this section we briefly describe two novel applications for checkpointing algorithms which are motivated by problems in cryptography and cyber security.
Let be a cryptographic hash function which maps -bit inputs to random-looking -bit outputs. A classical cryptanalytic problem is to find a collision in a given hash function, i.e., two different inputs which are mapped by to the same output . By the birthday paradox, we expect to find such a collision if we evaluate and compare the values of for random inputs . The naive method is to store all these values in an appropriate data structure, but since memory is much more expensive than time, we would like to find such a repetition using only a small number of memory cells. For the best known solution is to use Floydâs two finger algorithm (Floyd, 1997) which iterates the application of starting from some random initial point . Since the space of values is finite, the evolving chain must eventually repeat itself, and since is deterministic the chain will fold into a cycle, and repeat itself forever. Floydâs algorithm maintains two pointers along the generated chain of values by moving the endpoint pointer at speed and the midpoint pointer at speed . It stops when the two pointed values are the same. However, this algorithm is non-optimal for two reasons: It finds a collision only after wasting on average additional evaluation steps without noticing that its endpoint is already repeating itself, and it performs on average evaluations of to extend the evolving chain by one step. To reduce the number of wasted steps, we can maintain a larger number of pointers along the evolving chain, and thus catch the repetition of values at an earlier stage. To make each step more efficient, we can evaluate only the pointer at the end of the chain, and use our optimal pebbling strategies to leapfrog the memorized pointers to new locations along the chain at zero evaluation cost. We experimentally tested this strategy with pointers, and observed about reduction in the worst case waste of our collision finding algorithm compared to the standard interval doubling algorithm described by Brent (Brent, 1980).
A second application is related to backup strategies against sophisticated cyber attacks. Such attacks try to enhance their destructiveness by stealthily corrupting all the available backups before launching the actual attack. To model such attacks, we assume that the defenderâs backup strategy consists of deciding when to refresh the data in each one of his backup devices. When a backup device is connected to the main computer, one of two things can happen: if the computer is still clean, the device will instantaneously update all the files with their current contents; if the computer is already infected, all the data on the device will be lost. The problem is that the defender does not know whether his computer had already been compromised, and has to prepare for the worst possible choice of infection time. Note that the standard update strategy of keeping external disks stored unpowered in a safe and connecting one of them at the end of each day in a round robin way will lead to the loss of all the backups days after the initial infection. It can be shown that the best backup strategy in this model is to keep all the backups as evenly spread out as possible along the timeline, so that some of the backup disks will not be connected for a long time, while others will have relatively fresh versions of the file system. By using our proposed pebbling strategies, the defender can make his worst case loss as small as possible.
7. Concluding Remarks and Open Problems
In this paper we solved the main open problem in online checkpointing algorithms, which is to find tight asymptotic upper and lower bounds on their achievable efficiency. In addition, we developed efficient techniques for determining tight upper and lower bounds on for small values of , which enabled us to develop provably optimal concrete algorithms for all . However, determining the values of for larger values of remains a computationally challenging problem, and finding more efficient ways to compute these values remains an interesting open problem.
Appendix A Tables
Table 2 shows the best algorithms our LP approach found for . These are (perhaps non-tight) upper bounds on . Observe how some of the patterns are reminiscent of the pattern used in the algorithm Recursive of Section 4.
Tables 3 and 4 describe the efficiency of Recursive in two extreme cases: the âbestâ case , which is the special case handled by Binary of (Bringmann et al., 2013), and the âworstâ case , which shows that the upper bound we proved on the efficiency of Recursive is essentially tight.
- âą
In the first case, the optimal value for is the smallest real root of
[TABLE]
i.e., (1a) and (1b) are tight;
- âą
In the second case, the optimal value for all is the smallest real root of
[TABLE]
i.e., (1a) and (1c) are tight.
The second from the right column shows how close is to its lower bound , demonstrating the sharpness of Corollary 5.6. The rightmost column in each table shows the âeffective constantâ â defined to be for efficiency â as a percentage of the constant . The fact that it asymptotically approaches 100% in Table 4 shows that indeed is optimal and the analysis is tight.
Acknowledgements.
The work was partially supported by the the European Research Council under the ERC starting grant agreement n. 757731 (LightCrypt), the BIU Center for Research in Applied Cryptography and Cyber Security in conjunction with the Israel National Cyber Bureau in the Prime Ministerâs Office, and by the Israeli Science Foundation through grant No. 573/16.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Ahlroth et al . (2013) Lauri Ahlroth, Olli Pottonen, and AndrĂ© Schumacher. 2013. Approximately Uniform Online Checkpointing with Bounded Memory. Algorithmica 67, 2 (2013), 234â246.
- 3Andersen et al . (2016) M. S. Andersen, J. Dahl, and L. Vandenberghe. 2016. CVXOPT: A Python package for convex optimization. version 1.1.9. Available at http://cvxopt.org .
- 4Bern et al . (1994) Marshall W. Bern, Daniel H. Greene, Arvind Raghunathan, and Madhu Sudan. 1994. On-Line Algorithms for Locating Checkpoints. Algorithmica 11, 1 (1994), 33â52.
- 5Brent (1980) Richard P. Brent. 1980. An improved Monte-Carlo factorization algorithm. BIT Numerical Mathematics 20, 2 (1980), 176â184.
- 6Bringmann et al . (2013) Karl Bringmann, Benjamin Doerr, Adrian Neumann, and Jakub Sliacan. 2013. Online Checkpointing with Improved Worst-Case Guarantees. In Proceedings of the 40th International Colloquium on Automata, Languages, and Programming (ICALP) . Springer, 255â266.
- 7Chandy and Ramamoorthy (1972) K. Mani Chandy and Chittoor V. Ramamoorthy. 1972. Rollback and Recovery Strategies for Computer Programs. IEEE Trans. Computers 21, 6 (1972), 546â556.
- 8Floyd (1997) Robert W. Floyd. 1997. Cycle Finding Algorithm. , 7 pages. Floydâs algorithm appears as exercise 3.1-6 in Donald Knuth, The Art of Computer Programming , Vol. 2: Seminumerical algorithms.
