Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method
Carlos Pachajoa, Christina Pacher, Markus Levonyak, Wilfried N., Gansterer

TL;DR
This paper introduces an algorithm-based checkpoint-recovery method called ESRP for the conjugate gradient solver, reducing overhead and improving resilience against node failures by exploiting algorithmic redundancies.
Contribution
The paper develops ESRP, a novel variant of ESR that minimizes data storage and recovery overhead, integrating checkpoint-restart concepts into the PCG solver.
Findings
ESRP reduces overhead compared to ESR and in-memory CR.
ESRP has lower overhead than CR in failure-free scenarios.
CR outperforms ESRP during node failures.
Abstract
As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
