Improving Performance of Iterative Methods by Lossy Checkponting
Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, Franck Cappello

TL;DR
This paper introduces a lossy checkpointing scheme for iterative methods that reduces overhead and improves performance in large-scale parallel scientific computations by leveraging lossy compression and theoretical analysis.
Contribution
It proposes a novel lossy checkpointing scheme, develops a performance model with bounds, analyzes its impact on various iterative methods, and demonstrates significant efficiency gains in HPC environments.
Findings
Reduces checkpointing overhead by up to 70%.
Achieves 20-58% performance improvement over lossless methods.
Validates effectiveness on a 2,048-core HPC system.
Abstract
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
