EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures
Jie Ren, Kai Wu, Dong Li

TL;DR
EasyCrash is a framework that leverages non-volatile memory's properties to improve HPC system resilience by selectively persisting data, significantly increasing successful recomputations during crashes with minimal performance impact.
Contribution
It introduces EasyCrash, a novel framework for selective data persistence in NVM-based HPC, enhancing fault tolerance and system efficiency during failures.
Findings
Transforms 54% of unrecoverable crashes into correct recomputations.
Achieves 82% successful recomputation rate with intrinsic fault tolerance.
Enables up to 24% system efficiency improvement when combined with traditional checkpointing.
Abstract
Emerging non-volatile memory (NVM) is promising for building future HPC. Leveraging the non-volatility of NVM as main memory, we can restart the application using data objects remaining on NVM when the application crashes. This paper explores this solution to handle HPC under failures, based on the observation that many HPC applications have good enough intrinsic fault tolerance. To improve the possibility of successful recomputation with correct outcomes and ignorable performance loss, we introduce EasyCrash, a framework to decide how to selectively persist application data objects during application execution. Our evaluation shows that EasyCrash transforms 54% of crashes that cannot correctly recompute into the correct computation while incurring a negligible performance overhead (1.5% on average). Using EasyCrash and application intrinsic fault tolerance, 82% of crashes can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Radiation Effects in Electronics
