Recomputation Enabled Efficient Checkpointing
Ismail Akturk, Ulya R. Karpuzcu

TL;DR
This paper introduces AmnesiCHK, a recomputation-based checkpointing framework that reduces storage, time, and energy overheads of system checkpointing by selectively recomputing data values instead of storing them.
Contribution
It proposes a novel amnesic checkpointing approach leveraging recomputation to significantly lower checkpointing overheads in terms of storage, time, and energy.
Findings
Storage overhead reduced by up to 23.91%
Time overhead reduced by 11.92%
Energy overhead reduced by 12.53%
Abstract
Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby reduce the checkpointing overhead. Specifically, the resulting amnesic checkpointing framework AmnesiCHK can reduce the storage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Radiation Effects in Electronics
