JASS: A Flexible Checkpointing System for NVM-based Systems
Akshin Singh, Smruti R. Sarangi

TL;DR
This paper introduces JASS, a checkpointing system for NVM-based systems that reduces write amplification significantly by allowing some work loss and using adaptive checkpointing, thereby extending NVM lifespan.
Contribution
The paper presents a novel checkpointing approach that minimizes write amplification in NVM systems by relaxing instant recovery requirements and employing an adaptive control algorithm.
Findings
Reduces write amplification by 2.3-96% compared to existing methods.
Extends NVM lifespan proportionally to the reduction in write amplification.
Achieves near-optimal performance with the proposed adaptive system.
Abstract
NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC setup. The traditional line of thinking is to design a system that is conceptually similar to transactional memory, where we log updates all the time, and minimize the wasted work or alternatively the MTTR (mean time to recovery). Such ``instant recovery'' systems allow the system to recover from a point that is quite close to the point of failure. The penalty that we pay is the prohibitive number of additional writes to the NVM. We propose a paradigmatically different approach in this paper, where we argue that in most practical settings such as regular HPC workloads or neural network training, there is no need for such instant recovery. This means…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
