# How to Make the Preconditioned Conjugate Gradient Method Resilient   Against Multiple Node Failures

**Authors:** Carlos Pachajoa, Markus Levonyak, Wilfried N. Gansterer, Jesper, Larsson Tr\"aff

arXiv: 1907.13077 · 2019-08-21

## TL;DR

This paper enhances the fault tolerance of the parallel preconditioned conjugate gradient solver by extending an exact state reconstruction method to recover from multiple simultaneous node failures with minimal runtime overhead.

## Contribution

It introduces a refined ESR-based approach supporting recovery from multiple overlapping node failures for general sparsity patterns, improving resilience without significant overhead.

## Key findings

- Average runtime overheads between 2.8% and 55.0% for three node failures
- Supports recovery from simultaneous or overlapping failures
- Effective on large sparse matrices from real-world applications

## Abstract

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation.   In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.13077/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1907.13077/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1907.13077/full.md

---
Source: https://tomesphere.com/paper/1907.13077