How to Make the Preconditioned Conjugate Gradient Method Resilient   Against Multiple Node Failures

Carlos Pachajoa; Markus Levonyak; Wilfried N. Gansterer; Jesper; Larsson Tr\"aff

arXiv:1907.13077·cs.DC·August 21, 2019

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

Carlos Pachajoa, Markus Levonyak, Wilfried N. Gansterer, Jesper, Larsson Tr\"aff

PDF

TL;DR

This paper enhances the fault tolerance of the parallel preconditioned conjugate gradient solver by extending an exact state reconstruction method to recover from multiple simultaneous node failures with minimal runtime overhead.

Contribution

It introduces a refined ESR-based approach supporting recovery from multiple overlapping node failures for general sparsity patterns, improving resilience without significant overhead.

Findings

01

Average runtime overheads between 2.8% and 55.0% for three node failures

02

Supports recovery from simultaneous or overlapping failures

03

Effective on large sparse matrices from real-world applications

Abstract

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation. In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot…

Tables3

Table 1. Table 1. SPD matrices from (Davis and Hu, 2011 ) used in the experiments. n 𝑛 n : Problem size. N N Z 𝑁 𝑁 𝑍 NNZ : Number of non-zero entries. The matrices are ordered by increasing number of non-zero entries, with a larger ID indicating a larger number of non-zeros.

Name	Id	Problem type	$n$	$N N Z$
parabolic_fem	M1	Fluid dynamics	525 825	3 674 625
offshore	M2	Electromagnetics	259 789	4 242 673
G3_circuit	M3	Circuit simulation	1 585 478	7 660 826
thermal2	M4	Thermal	1 228 045	8 580 313
Emilia_923	M5	Structural	923 136	40 373 538
Geo_1438	M6	Structural	1 437 960	60 236 322
Serena	M7	Structural	1 391 349	64 131 971
audikw_1	M8	Structural	943 695	77 651 847

Table 2. Table 2. Summarized experimental results. Values are aggregated for experiments with different times (iteration numbers) when the failures are introduced. Matrices are ordered by descending reference runtime, with the ID number of a matrix growing with increasing number of non-zeros. t 0 subscript 𝑡 0 t_{0} : average reference time for the reference runs with regular (non-fault tolerant) PCG. ϕ italic-ϕ \phi : number of redundant copies for the search direction. ψ 𝜓 \psi : number of simultaneous node failures introduced. The column “relative overhead undisturbed” shows the mean overhead of modified ESR-capable PCG, with a given number of redundant copies ϕ italic-ϕ \phi , with respect to the reference time t 0 subscript 𝑡 0 t_{0} if no reconstruction takes place. The columns marked “relative reconstruction time” indicate the time that it takes to reconstruct the state of the solver, expressed as a percentage of the reference time t 0 subscript 𝑡 0 t_{0} , plus/minus a standard deviation. The columns marked “relative overhead with failures” show the mean, plus/minus a standard deviation, of the overall relative overhead, that is, the relative overhead for the total time until convergence, when node failures occur and the state of the solver is reconstructed, with respect to the reference runtime t 0 subscript 𝑡 0 t_{0} .

ID

t_{0}

[s]

Relative overhead undisturbed [%]

Failure location

Relative reconstruction time [%]

Overhead with failures [%]

ϕ = 1

ϕ = 3

ϕ = 8

ψ = ϕ = 1

ψ = ϕ = 3

ψ = ϕ = 8

ψ = ϕ = 1

ψ = ϕ = 3

ψ = ϕ = 8

M5

43.29

0.4

5.1

16.3

start

0.2

p m 0.1

0.3

p m 0.1

0.6

p m 0.1

1.2

p m 0.5

6.0

p m 0.3

19.6

p m 0.7

center

0.0

p m 0.1

0.1

p m 0.1

0.4

p m 0.1

1.2

p m 0.6

5.6

p m 0.3

20.5

p m 6.0

M8

19.12

1.5

2.2

8.2

start

1.4

p m 0.1

3.5

p m 0.2

10.7

p m 0.2

3.8

p m 1.0

5.6

p m 0.7

19.9

p m 0.9

center

0.4

p m 0.1

0.8

p m 0.1

1.4

p m 0.1

1.9

p m 0.4

2.8

p m 1.0

10.4

p m 1.1

M6

11.70

0.3

3.1

15.1

start

1.3

p m 0.1

2.4

p m 0.2

5.3

p m 0.2

1.4

p m 0.3

5.9

p m 0.4

21.2

p m 1.2

center

0.3

p m 0.1

0.6

p m 0.1

1.6

p m 0.1

0.7

p m 0.4

4.0

p m 0.3

18.3

p m 0.7

M7

6.48

0.2

4.3

13.6

start

2.6

p m 0.1

8.2

p m 0.3

23.4

p m 0.6

2.9

p m 0.3

13.7

p m 0.6

39.6

p m 1.4

center

1.2

p m 0.1

1.5

p m 0.1

21.1

p m 0.7

1.2

p m 0.2

7.1

p m 0.6

36.7

p m 1.7

M4

6.31

8.2

22.8

65.6

start

2.0

p m 0.1

3.1

p m 0.1

4.9

p m 0.3

9.6

p m 0.8

26.4

p m 1.1

66.0

p m 7.9

center

1.3

p m 0.1

3.8

p m 0.2

5.2

p m 0.4

9.2

p m 2.6

35.7

p m 2.2

63.4

p m 9.3

M1

2.25

3.6

5.1

24.5

start

0.3

p m 0.1

0.3

p m 0.1

0.4

p m 0.1

5.1

p m 1.2

4.3

p m 0.9

23.8

p m 2.1

center

0.3

p m 0.1

0.3

p m 0.1

0.4

p m 0.1

5.0

p m 1.0

5.1

p m 1.3

25.4

p m 1.8

M2

1.58

8.0

8.1

21.1

start

4.0

p m 0.2

7.0

p m 0.3

9.7

p m 0.4

7.3

p m 4.6

15.6

p m 2.3

30.1

p m 5.5

center

2.1

p m 0.3

3.5

p m 0.2

4.4

p m 0.3

10.0

p m 4.6

14.2

p m 2.6

23.8

p m 3.0

M3

1.16

5.0

24.1

91.3

start

0.6

p m 0.1

1.0

p m 0.1

1.9

p m 0.1

6.3

p m 1.7

24.8

p m 1.8

93.6

p m 6.0

center

8.0

p m 0.5

32.2

p m 1.2

58.1

p m 1.4

14.8

p m 2.0

55.0

p m 3.2

147.6

p m 5.0

Table 3. Table 3. Evaluation of the metric of Eqn. ( 7 ). The first column shows the maximum value of the relative residual deviation for all experiments with node failures for a matrix. The second column shows the relative residual deviation for the reference run.

ID	max $Δ_{ESR}$	$Δ_{PCG}$
M1	$3.46 × 10^{- 7}$	$- 1.63 × 10^{- 7}$
M2	$2.24 × 10^{- 7}$	$1.86 × 10^{- 7}$
M3	$1.57 × 10^{- 7}$	$1.57 × 10^{- 7}$
M4	$1.96 × 10^{- 7}$	$9.06 × 10^{- 8}$
M5	$3.59 × 10^{- 5}$	$1.81 × 10^{- 6}$
M6	$8.80 × 10^{- 8}$	$- 3.31 × 10^{- 8}$
M7	$1.32 × 10^{- 7}$	$- 7.24 × 10^{- 8}$
M8	$2.64 × 10^{- 3}$	$1.49 × 10^{- 3}$

Equations28

\IfStrEq j u_{I_{1}} u_{I_{1}}^{(j)}

\IfStrEq j u_{I_{1}} u_{I_{1}}^{(j)}

\IfStrEq j u_{I_{2}} u_{I_{2}}^{(j)}

\makebox [0.0 pt] ⋮

\IfStrEq j u_{I_{N}} u_{I_{N}}^{(j)}

S_{i} : = \leavevmode

S_{i} : = \leavevmode

S_{ik} : = \leavevmode

R_{i} : = \leavevmode

m_{i} : S_{i} \to \leavevmode

m_{i} : S_{i} \to \leavevmode

s \mapsto \leavevmode

during the computation of \leavevmode A \IfStrEq j p p^{(j)} .

R_{i}^{c} = {s \in S_{i} ∣ m_{i} (s) = 0} .

R_{i}^{c} = {s \in S_{i} ∣ m_{i} (s) = 0} .

d_{ik}\coloneqq\left\{\begin{array}[]{ll}\left(i+\left\lceil\frac{k}{2}\right\rceil\right)\bmod N,&\text{if}\leavevmode\nobreak\ k\leavevmode\nobreak\ \text{odd}\\ \left(i-\frac{k}{2}\right)\bmod N,&\text{if}\leavevmode\nobreak\ k\leavevmode\nobreak\ \text{even}\end{array}\right.

d_{ik}\coloneqq\left\{\begin{array}[]{ll}\left(i+\left\lceil\frac{k}{2}\right\rceil\right)\bmod N,&\text{if}\leavevmode\nobreak\ k\leavevmode\nobreak\ \text{odd}\\ \left(i-\frac{k}{2}\right)\bmod N,&\text{if}\leavevmode\nobreak\ k\leavevmode\nobreak\ \text{even}\end{array}\right.

R_{ik}^{c} : = {s \in S_{i} ∣ s \in / S_{i d_{ik}} \land m_{i} (s) - g_{i} (s) \leq ϕ - k}

R_{ik}^{c} : = {s \in S_{i} ∣ s \in / S_{i d_{ik}} \land m_{i} (s) - g_{i} (s) \leq ϕ - k}

0

0

\leq i max (λ_{ik} + ∣ R_{ik}^{c} ∣ μ) \leq i max λ_{ik} + i max ∣ R_{ik}^{c} ∣ μ,

0

0

\leq k = 1 \sum ϕ i max (λ_{ik} + ∣ R_{ik}^{c} ∣ μ) \leq k = 1 \sum ϕ (i max λ_{ik} + i max ∣ R_{ik}^{c} ∣ μ)

= k = 1 \sum ϕ \leq λ_{m a x} i max λ_{ik} + k = 1 \sum ϕ \leq ⌈ \frac{n}{N} ⌉ i max ∣ R_{ik}^{c} ∣ μ \leq ϕ λ_{m a x} + ϕ ⌈ \frac{n}{N} ⌉ μ,

Δ_{\mbox E S R} Δ_{\mbox P C G} = \frac{∥ r _{\mbox E S R} ∥ _{2} - ∥ \IfStrEq b b ^{()} - A x _{\mbox E S R} ∥ _{2}}{∥ \IfStrEq b b ^{()} - A x _{\mbox E S R} ∥ _{2}} = \frac{∥ r _{\mbox P C G} ∥ _{2} - ∥ \IfStrEq b b ^{()} - A x _{\mbox P C G} ∥ _{2}}{∥ \IfStrEq b b ^{()} - A x _{\mbox P C G} ∥ _{2}} .

Δ_{\mbox E S R} Δ_{\mbox P C G} = \frac{∥ r _{\mbox E S R} ∥ _{2} - ∥ \IfStrEq b b ^{()} - A x _{\mbox E S R} ∥ _{2}}{∥ \IfStrEq b b ^{()} - A x _{\mbox E S R} ∥ _{2}} = \frac{∥ r _{\mbox P C G} ∥ _{2} - ∥ \IfStrEq b b ^{()} - A x _{\mbox P C G} ∥ _{2}}{∥ \IfStrEq b b ^{()} - A x _{\mbox P C G} ∥ _{2}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

How to Make the Preconditioned Conjugate Gradient Method Resilient

Against Multiple Node Failures

Carlos Pachajoa

0000-0002-0688-8130

University of ViennaFaculty of Computer ScienceViennaAustria

[email protected]

,

Markus Levonyak

0000-0002-5131-6318

University of ViennaFaculty of Computer ScienceViennaAustria

[email protected]

,

Wilfried N. Gansterer

University of ViennaFaculty of Computer ScienceViennaAustria

[email protected]

and

Jesper Larsson Träff

TU WienFaculty of InformaticsViennaAustria

[email protected]

(2019)

Abstract.

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation.

In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen’s method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.

preconditioned conjugate gradient method, extreme-scale parallel computing, fail-stop failures, multiple node failures, resilience, algorithmic fault tolerance

††ccs: Mathematics of computing Solvers††ccs: Mathematics of computing Mathematical software performance††ccs: Computing methodologies Parallel algorithms††journalyear: 2019††copyright: acmlicensed††conference: 48th International Conference on Parallel Processing; August 5–8, 2019; Kyoto, Japan††price: 15.00††doi: 10.1145/3337821.3337849††isbn: 978-1-4503-6295-5/19/08

1. Introduction

One of the major challenges in current and, even more, in future high-performance computing (HPC) is the increasing failure rate caused by increasing complexity and an ever-growing number of interconnected components (Snir et al., 2014; Gupta et al., 2017). Among the most common types of failures in large-scale parallel computers are fail-stop failures, where a failing process stops and its data is lost (Schlichting and Schneider, 1983; Elnozahy et al., 2002; Schroeder and Gibson, 2010; Chen, 2011). In order to support long-running scientific applications also on failure-prone HPC systems, new strategies and resilient algorithms are necessary (Herault and Robert, 2015).

We consider the preconditioned conjugate gradient (PCG) method (cf. Sec. 2.1), an important iterative method for solving symmetric and positive-definite (SPD) sparse linear systems on large-scale parallel computers. We study settings where the PCG solver is executed on parallel computers that are susceptible to node failures, a common type of fail-stop failures which is critical in practice (Herault and Robert, 2015). In contrast to previous work that also aims to make the PCG method resilient against unexpected node failures without expensive checkpointing (cf. Sec. 1.2), we do not only consider single node failures but multiple node failures which occur simultaneously or are overlapping in time. The exact state reconstruction (ESR) approach (cf. Sec. 2.2), which was introduced by Chen (Chen, 2011) and refined by Pachajoa et al. (Pachajoa et al., 2018), was shown to be the most efficient checkpointing-free algorithmic fault-tolerance technique for protecting the PCG solver against unexpected single node failures (Pachajoa et al., 2018).

In this paper, we assume that a large sparse linear system $\bm{A}\bm{x}=\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}$ , where $\bm{A}$ is SPD, is given and shall be solved with the PCG method on a parallel computer. The preconditioner $\bm{M}$ for solving this linear system is either explicitly or implicitly given. We enhance the ESR approach so that it becomes capable of protecting the PCG method against unexpected simultaneous or overlapping failures of multiple nodes, theoretically analyze the communication overhead for this improved resilience, and demonstrate low runtime overheads in numerical experiments. Although we cannot provide details due to space restrictions, our proposed algorithmic modifications can also be applied to the ESR approach (Chen, 2011) for the Jacobi, Gauss-Seidel, successive overrelaxation (SOR), symmetric successive overrelaxation (SSOR), split preconditioner conjugate gradient (SPCG) (Pachajoa et al., 2018) and preconditioned bi-conjugate gradient stabilized (BiCGSTAB) algorithms in order to make them resilient against multiple simultaneous or overlapping node failures.

1.1. Problem setting and assumptions

We consider the parallel execution of the PCG solver on $N$ compute nodes of a distributed-memory parallel computer, which communicate over an interconnection network. Each compute node consists of $m$ processors such that the solver is executed on $M\coloneqq N\times m$ processors in total. Each processor shares its memory with all other processors on the same compute node.

In the event of a single processor failure, exactly one processor on one compute node fails, but the shared memory on this node stays intact. Hence, no data is lost and the remaining processors on the node can take over the workload of the failed processor. The same holds for multiple processor failures where at least one processor per compute node survives. In this paper, we consider a more complicated event: a node failure, where a compute node fails as a whole and the data in the memory of the affected node is lost.

Node failures may occur for several reasons: for example, all $m$ processors of a node fail, the shared memory of a node gets corrupted, or a node loses its connection to the interconnection network. In case of a single node failure, exactly one compute node fails at a time. If more than one node fails at a time, we talk about multiple node failures. Nodes that continue working after a node failure and keep all their data are called surviving nodes. A node that becomes unavailable after a node failure is referred to as a failed node, and a node that takes the place of a failed node in the recovery process is called a replacement node (which could be a spare node or one of the surviving nodes).

1.1.1. Failure detection and node replacement

We assume the availability of a parallel runtime environment that provides functionality comparable to state-of-the-art implementations of the industry-standard Message Passing Interface (MPI) (Message Passing Interface Forum, 2015). This particularly comprises efficient collective communication capabilities like the Allreduce function of MPI. Beyond that, we assume that the underlying runtime environment supports some basic fault-tolerance features. A prototypical example is the MPI extension User Level Failure Mitigation (ULFM) (Bland et al., 2013; Message Passing Interface Forum, 2017) which supports the detection of node failures, the prevention of indefinitely blocking synchronizations or communications, the notification of the surviving nodes that a failure has occurred and which nodes have failed, and a mechanism for providing replacement nodes.

1.1.2. Data distribution

Analogously to (Chen, 2011), we assume that the fundamental problem-defining static input data, i.e., the system matrix $\bm{A}\in\mathbb{R}^{n\times n}$ , the right-hand-side vector $\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}\in\mathbb{R}^{n}$ , and the preconditioner $\bm{M}\in\mathbb{R}^{n\times n}$ (cf. Sec. 2.1), can be retrieved from reliable external storage (e.g., from a checkpoint prior to entering the linear solver, cf. the ABFT&PeriodicCkpt algorithm (Bosilca et al., 2014, 2015; Herault and Robert, 2015), see Sec. 1.2), and thus does not have to be reconstructed after a node failure.

In accordance with typical distributions of sparse matrices in widely used high-performance numerical libraries like PETSc (Balay et al., 1997), we consider a block-row data distribution of all matrices and vectors among the $N$ nodes of the parallel computer. More precisely, every node owns blocks of $n/N$ contiguous rows (if $n=cN$ with $c\in\mathbb{N}$ , otherwise some nodes own $\lfloor n/N\rfloor$ and others $\lceil n/N\rceil$ rows) of both the matrices $\bm{A}$ and $\bm{M}$ as well as each of the vectors \IfStrEq $\bm{b}$$\bm{b}^{()}$ , $\bm{x}$ , and all other vectors maintained by the PCG solver (cf. Sec. 2.1). On one node, the owned block rows, which are stored in the shared memory of the node, are evenly distributed among the $m$ processors, i.e., each processor owns (approximately) $n/M$ rows of each of the matrices and vectors. Globally used scalars are replicated on all $N$ nodes.

Since each node owns block rows of all matrices and vectors, a node failure affects a part of each matrix and vector. Block rows of dynamic data which were owned by the failed node are lost and need to be reconstructed on the replacement node (cf. Sec. 2.2).

1.1.3. Notation

We use a notation similar to (Agullo et al., 2016) to denote sections of matrices and vectors. We refer to the set of all indices as $I\coloneqq\{1,2,\dots,n\}$ . The cardinality $n$ of $I$ is equal to the size of the vectors. The index subset representing the rows assigned to node $i$ is denoted as $I_{i}$ . Given a vector \IfStrEqj $\bm{v}$$\bm{v}^{(j)}$ , where $j$ denotes the iteration number of the linear solver, \IfStrEqj $\bm{v}_{I_{i}}$$\bm{v}_{I_{i}}^{(j)}$ refers to the subset of elements of the vector at iteration $j$ owned by node $i$ . Row and column selections of a matrix $\bm{B}$ are designated with index sets as well: $\bm{B}_{I_{i},I_{k}}$ refers to the selection of rows and columns of $\bm{B}$ corresponding to the index subsets $I_{i}$ and $I_{k}$ , respectively.

The failed node is referred to as node $f$ , and its index set is hence denoted as $I_{f}$ . With the state of an iterative solver we mean the—not necessarily minimal—set of data that completely defines the future behavior of this iterative solver. The state of the PCG solver in iteration $j$ can be defined as comprising the iterate \IfStrEqj $\bm{x}$$\bm{x}^{(j)}$ (i.e., the current approximation to the solution $\bm{x}$ ), the residual \IfStrEqj $\bm{r}$$\bm{r}^{(j)}$ , the preconditioned residual \IfStrEqj $\bm{z}$$\bm{z}^{(j)}$ , and the search direction \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ .

1.2. Related work

The existing literature on resilient iterative linear solvers distinguishes between two different kinds of failures: soft errors and node failures. The former refers to spontaneous changes of the state of the solver (e.g., bit flips), potentially leading to a wrong result. In this category, we find work by Sao and Vuduc (Sao and Vuduc, 2013) which proposes strategies to ensure that the conjugate gradient (CG) method will converge to the right solution after a soft error. However, their approach requires some operations to be performed reliably.

Bronevetsky and de Supinski (Bronevetsky and de Supinski, 2008) evaluate the effects of soft errors on iterative linear solvers including the CG method. Bit flips are introduced at random times and positions, and the effects are classified according to the resulting runtimes and solution errors. Dichev and Nikolopoulos (Dichev and Nikolopoulos, 2016) propose and experimentally evaluate a specific form of dual modular redundancy, where all computations are performed twice for improved redundancy, in order to detect and correct soft errors in the PCG method. More work in the area of soft error detection and correction in the CG and PCG methods has been published by Shantharam et al. (Shantharam et al., 2012) and Fasi et al. (Fasi et al., 2016). All these approaches have in common that they are not applicable to our problem of the PCG solver subject to node failures.

Pachajoa and Gansterer (Pachajoa and Gansterer, 2018) experimentally evaluate the inherent resilience of the CG method after a single node failure. The currently in practice most commonly used class of fault-tolerance techniques to cope with node failures is checkpoint/restart (C/R). These techniques frequently save the current state of a running application and roll back to the latest saved state in case of a node failure. C/R has been investigated as a general-purpose technique (e.g. (Tiwari et al., 2014)), and also for specific problem settings (e.g. (Ltaief et al., 2008)). Herault et al. (Herault and Robert, 2015) provide a comprehensive overview of different variants of this approach.

To avoid the overhead of continuously saving the state, Chen (Chen, 2011) exploits specific properties of the PCG solver and other iterative methods such that the state of the solver can be recovered without checkpointing after a single node failure occurred. For storing the necessary redundant information, he chooses the intuitive approach of sending it to the closest node (cf. Sec. 3). Pachajoa et al. (Pachajoa et al., 2018) refine Chen’s approach (Chen, 2011) by distinguishing different common types of preconditioners.

In (Pachajoa et al., 2018), the approach from (Chen, 2011) is experimentally compared to a heuristic strategy proposed by Langou et al. (Langou et al., 2007). This heuristic interpolation/restart strategy is applicable to iterative linear solvers in general to recover from a node failure by approximating the lost iterate. The submatrix of the replacement node is used to produce an interpolated approximation of the iterate before the node failure occurred. Agullo et al. (Agullo et al., 2016, 2013) extend this approach by using all the information of the matrix in the interpolation, which produces a reconstructed iterate whose error norm is guaranteed to be smaller than the error norm of the iterate before the node failure, albeit with significant communication overhead.

Bosilca et al. (Bosilca et al., 2014, 2015; Herault and Robert, 2015) introduce the ABFT&PeriodicCkpt algorithm, which combines algorithm-based fault tolerance (ABFT) with periodic checkpointing in order to make entire applications (and not just operations that can be protected by ABFT) resilient to node failures. The longer the phases protected by ABFT, the fewer checkpoints are necessary and the cheaper resilience against node failures usually becomes (assuming that the used ABFT method is more efficient than checkpointing).

1.3. Contributions of this work

To the best of our knowledge, there has been neither a detailed discussion nor a thorough analysis of a generalized ESR approach for protecting the PCG method against multiple node failures. In this paper, we propose a strategy that precisely defines how and where to store redundant information so that the PCG solver becomes resilient against multiple node failures. In particular, we analyze the communication overhead and evaluate the performance penalty for the improved resilience capabilities of being able to tolerate multiple simultaneous node failures. In numerical experiments on the Vienna Scientific Cluster (VSC), a medium-scale HPC system, we eventually demonstrate the low runtime overhead of our new strategies for the improved resilience.

The remainder of this paper is organized as follows. In Sec. 2, we briefly review the PCG method and the ESR approach as presented in (Chen, 2011; Pachajoa et al., 2018). Afterwards, in Sec. 3, we summarize the concept of Chen (Chen, 2011) for keeping redundant information in the ESR approach such that single node failures can be tolerated. Then, in Sec. 4, we propose modifications and extensions of the ESR approach for systems that are prone to multiple simultaneous or overlapping node failures. Apart from that, we theoretically analyze the communication overhead of our novel algorithm for supporting multiple node failures. In Sec. 5, we discuss the most important aspects regarding the impact of the sparsity pattern of the system matrix. Next, in Sec. 6, we summarize our implementation for conducting numerical experiments. Subsequently, in Sec. 7, we present our experimental results and discuss how they relate to our theoretical analysis. Finally, our conclusions are summarized in Sec. 8.

2. Algorithmic background

In this section, we first review the PCG method in Sec. 2.1 and afterwards the ESR approach (Chen, 2011; Pachajoa et al., 2018) in Sec. 2.2. In Sec. 3, we then look at the details of how Chen (Chen, 2011) is keeping redundant information in order to guarantee protection against single node failures.

2.1. Preconditioned conjugate gradient method

The (P)CG method iteratively solves a linear system $\bm{A}\bm{x}=\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}$ , where both the SPD matrix $\bm{A}\in\mathbb{R}^{n\times n}$ and the right-hand-side vector $\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}\in\mathbb{R}^{n}$ are given. The search directions \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ , along which the quadratic potential defined by $\bm{A}$ is minimized, are chosen to be $\bm{A}$ -orthogonal, i.e., $\bm{p}^{(j)\top}\bm{A}\IfStrEq{k}{}{\bm{p}}{\bm{p}^{(k)}}=0$ for all $j\neq k$ . The residual \IfStrEqj $\bm{r}$$\bm{r}^{(j)}$ is defined as $\IfStrEq{j}{}{\bm{r}}{\bm{r}^{(j)}}=\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}-\bm{A}\IfStrEq{j}{}{\bm{x}}{\bm{x}^{(j)}}$ .

In the PCG method, which is listed in Alg. 1, a preconditioner $\bm{M}\in\mathbb{R}^{n\times n}$ is used for accelerating convergence. Instead of the original system, the linear system $\bm{M}^{-1}\bm{A}\bm{x}=\bm{M}^{-1}\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}$ is solved. $\bm{M}$ is assumed to be an SPD matrix as well (Saad, 2003, p. 276) and is to be chosen such that $\kappa(\bm{M}^{-1}\bm{A})<\kappa(\bm{A})$ , where $\kappa(\bm{B})$ denotes the condition number of a matrix $\bm{B}$ . The PCG method stores the distributed vectors \IfStrEqj $\bm{x}$$\bm{x}^{(j)}$ , \IfStrEqj $\bm{r}$$\bm{r}^{(j)}$ , \IfStrEqj $\bm{z}$$\bm{z}^{(j)}$ , \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ , and $\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ as well as the replicated scalars $\alpha^{(j)}$ and $\beta^{(j)}$ , which in total require memory for $5n+2N$ floating-point numbers (not including the static data $\bm{A}$ , $\bm{M}$ , and \IfStrEq $\bm{b}$$\bm{b}^{()}$ ).

2.2. Exact state reconstruction (ESR)

In contrast to checkpoint/restart methods, which—even in the failure-free case—impose a usually considerable runtime overhead due to continuously saving the state of the solver (Herault and Robert, 2015), the ESR approach (Chen, 2011; Pachajoa et al., 2018) is able to exploit the algorithmic properties of the PCG solver so that the complete state can be reconstructed after a node failure with low overhead. During the failure-free PCG iterations, only little or—depending on the sparsity structure of $\bm{A}$ —no additional communication is necessary for achieving this. Only the local memory requirements are slightly higher than in the standard (non-resilient) PCG variant.

At iteration $j$ of the PCG algorithm, the product $\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ is computed in a sparse matrix-vector multiplication (SpMV) operation (cf. Alg. 1, lines 3 and 5). In the non-resilient PCG solver, all but the own block \IfStrEqj $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j)}$ can be dropped on node $i$ after the product has been computed. In the resilient algorithm, we also drop most of \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ on each node after the SpMV. The crucial difference to non-resilient PCG is that now each node also stores the elements of at least one other node in addition to its own block (for details see Sec. 3).

Overall, there is a redundant copy of each element of \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ after computing $\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ . Hence, in case of a node failure, this redundant copy can be sent to the replacement node in order to completely recover the most recent search direction. For the reconstruction of the complete state after a node failure, we need to recover the two most recent search directions (Chen, 2011, Sec. 5.2) and thus have to keep redundant copies of each element of not only \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ but also \IfStrEqj-1 $\bm{p}$$\bm{p}^{(j-1)}$ . The local memory overhead of $2n/N$ vector elements per node and $2n$ vector elements in total is negligible compared to the overall memory requirement $\mathcal{O}(n^{2})$ of the PCG solver (cf. Sec. 2.1).

The procedure to reconstruct the complete state of the PCG solver is outlined in Alg. 2. It is assumed that a preconditioner $\bm{P}\coloneqq\bm{M}^{-1}$ is given. Variants for cases where $\bm{M}$ (not $\bm{M}^{-1}$ ) or a split preconditioner $\bm{M}=\bm{L}\bm{L}^{\top}$ is given are shown in (Pachajoa et al., 2018, Alg. 3 and 5). The scalar $\beta^{(j-1)}$ can easily be recovered since it is replicated on every node (cf. Sec. 2.1), i.e., $\beta^{(j-1)}$ has the same value on all nodes. Furthermore, the reconstruction of the lost parts \IfStrEqj $\bm{r}_{I_{f}}$$\bm{r}_{I_{f}}^{(j)}$ and \IfStrEqj $\bm{z}_{I_{f}}$$\bm{z}_{I_{f}}^{(j)}$ of the (preconditioned) residuals \IfStrEqj $\bm{r}$$\bm{r}^{(j)}$ and \IfStrEqj $\bm{z}$$\bm{z}^{(j)}$ takes place on the replacement node $f$ . Matrix $\bm{A}_{I_{f},I_{f}}$ of the linear system in line 8 of Alg. 2 is SPD, has full rank, and is much smaller than the full system matrix $\bm{A}$ . Therefore, this linear system can be solved locally on the replacement node $f$ (only the involved vectors need to be gathered first on node $f$ ). Detailed derivations of the ESR approach can be found in (Chen, 2011; Pachajoa et al., 2018).

3. Single node failure

We now discuss details of how and where to store the redundant copies of the two most recent search directions. For this purpose, we review the strategy proposed by Chen (Chen, 2011) for protecting the PCG method against a single node failure. However, as we will see, this strategy is not suitable for multiple simultaneous or overlapping node failures. Later, in Sec. 4, we present our new strategy for handling multiple simultaneous or overlapping node failures.

In PCG, the SpMV operation for obtaining $\IfStrEq{j}{}{\bm{u}}{\bm{u}^{(j)}}=\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ , which can be rewritten as

[TABLE]

is performed at each iteration (cf. lines 3 and 5 of Alg. 1). When ignoring possible optimizations due to the sparsity pattern of $\bm{A}$ , \IfStrEqj $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j)}$ is sent from node $i$ to node $k$ in iteration $j$ such that $\IfStrEq{j}{}{\bm{u}_{I_{k}}}{\bm{u}_{I_{k}}^{(j)}}=\bm{A}_{I_{k},I_{1}}\IfStrEq{j}{}{\bm{p}_{I_{1}}}{\bm{p}_{I_{1}}^{(j)}}+\dots+\bm{A}_{I_{k},I_{i}}\IfStrEq{j}{}{\bm{p}_{I_{i}}}{\bm{p}_{I_{i}}^{(j)}}+\dots+\bm{A}_{I_{k},I_{N}}\IfStrEq{j}{}{\bm{p}_{I_{N}}}{\bm{p}_{I_{N}}^{(j)}}$ can be computed on node $k$ . For a more optimized algorithm that, during SpMV, only sends the minimum set of elements required due to the sparsity pattern of $\bm{A}$ , we define

[TABLE]

in accordance with Chen (Chen, 2011). \IfStrEqj $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j)}$ can be completely recovered after a node failure if $R_{i}=S_{i}$ . In order to ensure this, Chen proposes to send $R_{i}^{c}$ to node $d_{i}\coloneqq(i+1)\bmod N$ (together with $S_{id_{i}}$ ).

Unfortunately, this strategy is not capable of coping with simultaneous or overlapping failures of multiple nodes. For example, if both nodes $i$ and $i+1$ fail simultaneously and $R_{i}^{c}\neq\emptyset$ , the search direction vector elements in $R_{i}^{c}$ are lost and the state of the PCG solver cannot be reconstructed. The problem obviously worsens if more than two nodes fail simultaneously.

4. Multiple node failures

In this section, we extend the ESR approach for coping with multiple simultaneous or overlapping node failures. Originally, the ESR method considers exactly one node failure at a time, i.e., it is assumed that the reconstruction process finishes before another node failure occurs (cf. Sec. 2.2 and Sec. 3). For coping with up to $\phi<N$ uniformly distributed node failures that may overlap in time, we need to keep $\phi$ redundant copies of each block of the two most recent search directions \IfStrEqj-1 $\bm{p}$$\bm{p}^{(j-1)}$ and \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ on $\phi$ different compute nodes (Pachajoa et al., 2018). In the following, we design a resilient algorithm based on this idea in detail and analyze its communication overhead.

4.1. Tolerating multiple node failures

To keep $\phi$ redundant copies of each block of the two most recent search direction vectors, a similar strategy can be pursued as in the special case $\phi=1$ . Let $(S_{i},m_{i})$ be a multiset with the multiplicity

[TABLE]

Note that we assume here—as it is common for SpMV—that $s$ is only sent to nodes with corresponding non-zero entries in their rows of $\bm{A}$ . Hence, the number of nodes $s$ is sent to depends on the sparsity pattern of $\bm{A}$ . Comparing the definition of the multiplicity $m$ in Eqn. (3) with the definition of $R_{i}^{c}$ in Eqn. (2), it follows that

[TABLE]

For supporting up to $\phi$ simultaneous (or overlapping) node failures, we need to store at least $\phi$ redundant copies of each element of \IfStrEqj $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j)}$ , $i\in\{1,2,\dots,N\}$ , on $\phi$ different nodes other than node $i$ . Let

[TABLE]

and let $g_{i}(s)$ denote the number of sets $S_{id_{ik}}$ with $s\in S_{id_{ik}}$ for all $k\in\{1,2,\dots,\phi\}$ . Then, the required redundancy for tolerating up to $\phi$ simultaneous node failures can be achieved by sending

[TABLE]

to node $d_{ik}$ for all $i\in\{1,2,\dots,N\}$ and $k\in\{1,2,\dots,\phi\}$ . Note that the sets $R_{ik}^{c}$ are of minimal size such that the required number $\phi$ of redundant copies of each search direction vector element is ensured. It holds that $|R_{i1}^{c}|\geq|R_{i2}^{c}|\geq\dots\geq|R_{i\phi}^{c}|$ .

The strategy proposed in Eqn. (5) for selecting the nodes to keep the redundant copies of \IfStrEqj-1 $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j-1)}$ and \IfStrEqj $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j)}$ is a reasonably good heuristic for minimizing communication overheads during SpMV if we assume that the entries of the system matrix $\bm{A}$ are mostly clustered around the diagonal (since it then is likely that there are some elements which have to be sent anyway from node $i$ to node $d_{ik}$ and, thus, there is no extra latency for establishing a new connection; see Sec. 5 for a more detailed discussion). For matrices with very different sparsity patterns, strategies different from Eqn. (5) may be preferable. A comprehensive analysis of the interaction between a given sparsity pattern and the optimal choice of the “backup nodes” is work in progress.

When the backup strategy based on Eqns. (5) and (6) is employed, we have $\phi+1$ copies of each element of \IfStrEqj-1 $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j-1)}$ and \IfStrEqj $\bm{p}_{I_{i}}$$\bm{p}_{I_{i}}^{(j)}$ on $\phi+1$ different nodes (including node $i$ that owns the block) and the two most recent search directions \IfStrEqj-1 $\bm{p}$$\bm{p}^{(j-1)}$ and \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ can be fully recovered after a simultaneous or overlapping failure of up to $\phi$ arbitrary nodes. If the node failures do not happen simultaneously but are overlapping in time, i.e., more node failures occur during the reconstruction phase, the reconstruction process must be restarted after each node failure (an efficient implementation can of course skip steps that have already been performed and are not affected by the subsequent node failures).

We consider $\psi\leq\phi$ node failures. Let $f_{1},f_{2},\dots,f_{\psi}$ denote the nodes that fail. Then, we can define $I_{f}\coloneqq I_{f_{1}}\cup I_{f_{2}}\cup\dots\cup I_{f_{\psi}}$ and use a similar reconstruction procedure as in the case of a single node failure. Some of the reconstruction steps of Alg. 2 can be performed locally on each of the replacement nodes $f_{1},f_{2},\dots,f_{\psi}$ . However, for computing the matrix-vector products and solving the linear systems in lines 5 to 8 of Alg. 2, additional communication between the $\psi$ replacement nodes is necessary. In Sec. 4.2, we show analytically that the capability of tolerating up to $\phi$ node failures may incur increased communication cost. Hence, $\phi$ should be chosen only as large as necessary for handling the expected number of simultaneous or overlapping node failures on a given parallel computer.

4.2. Analysis

Communication time usually is the main cost for a parallel algorithm, dominating the computation time (Ballard et al., 2014). For analyzing the communication overhead of sending the additional elements of the sets $R_{ik}^{c}$ as defined in Eqn. (6), we adopt a latency-bandwidth communication model (Chan et al., 2007) and assume that the communication cost solely depends on latencies $\lambda_{ik}>0$ —which may vary for different sending nodes $i$ and corresponding receiving nodes $d_{ik}$ according to Eqn. (5)—and cost $\mu>0$ per vector element. We further assume that each node is able to send and receive exactly one element at a time. If $S_{id_{ik}}\neq\emptyset$ , the additional elements of $R_{ik}^{c}$ are sent together with the elements of $S_{id_{ik}}$ , which have to be sent anyway during computing the sparse matrix-vector product $\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ .

If this is the case for all pairs of nodes for a fixed $k\in\{1,2,\dots,\phi\}$ , which we refer to as communication round $k$ , no extra latency cost applies in that round, and the overhead is $\max_{i}|R_{ik}^{c}|\mu$ . In contrast, if $\forall i\in\{1,2,\dots,N\}\colon S_{id_{ik}}=\emptyset$ in communication round $k\in\{1,2,\dots,\phi\}$ , extra latencies $\lambda_{ik}$ incur for all pairs of nodes, and the overhead is $\max_{i}(\lambda_{ik}+|R_{ik}^{c}|\mu)\leq\max_{i}\lambda_{ik}+\max_{i}|R_{ik}^{c}|\mu$ . Hence, in communication round $k\in\{1,2,\dots,\phi\}$ , it holds for the communication overhead $\mathcal{O}$ that

[TABLE]

where $i\in\{1,2,\dots,N\}$ and $\max_{i}|R_{ik}^{c}|\mu=0$ if and only if $\forall i\colon|R_{ik}^{c}|=0$ , i.e., there are at least $\phi-k+1$ redundant copies of all elements of \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ due to the sparsity pattern of $\bm{A}$ . For all $\phi$ communication rounds, it follows that

[TABLE]

where $\lambda_{\max}\coloneqq\max_{i,k}\lambda_{ik}$ . Hence, the communication overhead $\mathcal{O}$ for keeping $\phi$ redundant copies of all elements of \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ lies between the lower bound 0 and the upper bound $\phi\left(\lambda_{\max}+\left\lceil\frac{n}{N}\right\rceil\mu\right)$ . The actual communication overhead within that interval entirely depends on the sparsity pattern of $\bm{A}$ , which determines the elements in $R_{ik}^{c}$ .

5. Influence of the sparsity pattern

In this section, we take a closer look at the implications of our theoretical analysis in Sec. 4. As we showed in Sec. 4.2, there exist matrices with sparsity patterns which lead to zero communication overhead during the failure-free execution of the PCG solver, i.e., no extra communication is necessary for distributing $\phi$ redundant copies of all elements of the search direction vector \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ during the computation of the sparse matrix-vector product $\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ . On the other hand, other matrix sparsity patterns may lead to significant communication overhead during the SpMV operation.

Independent of the particular strategy for selecting the backup nodes, it is clearly beneficial for having a low communication overhead if $m_{i}(s)\geq\phi$ (cf. Eqn. (3)) holds for most or even all $s\in S_{i}$ and $i\in\{1,2,\dots,N\}$ or, in other words, if most or all elements of \IfStrEqj $\bm{p}$$\bm{p}^{(j)}$ anyway—i.e., due to the sparsity pattern of $\bm{A}$ —have to be communicated to at least $\phi$ different nodes during the computation of the product $\bm{A}\IfStrEq{j}{}{\bm{p}}{\bm{p}^{(j)}}$ .

If this is not the case, a possibly considerable number of extra vector elements has to be transferred during the SpMV operation. Since the number of extra elements to be sent is determined by the sparsity pattern of $\bm{A}$ , communication cost can then only be reduced by avoiding extra latencies, i.e., by sending the extra elements to nodes where also other elements have to be sent to. In general, our strategy defined by Eqns. (5) and (6) performs well if $\bm{A}$ is not too sparse within a bandwidth of $\lceil\phi n/(2N)\rceil$ around the diagonal (but can still be very sparse overall). More formally, if for all $i\in\{1,2,\dots,N\}$ and $k\in\{1,2,\dots,\phi\}$ at least one element in each submatrix $\bm{A}_{I_{d_{ik}},I_{i}}$ is not equal to zero, no cost because of extra latencies is incurred. The corresponding proofs are straightforward, but we are omitting them due to space restrictions.

6. Implementation

Up to this point, we have analyzed the algorithms in theoretical terms. We now describe how the reconstruction is realized in a real-life, finite-precision machine.

We implement Chen’s algorithm (Chen, 2011) and our novel extensions as described in Sec. 3 and Sec. 4 using the PETSc framework (Balay et al., 1997, 2018). PETSc provides the CG solver and linear algebra operations, and it also manages communication between nodes. We use operations offered by the framework to transfer the information required for the recovery from node failures. We use a block Jacobi as a preconditioner during the regular operation of the solver, solving the preconditioner blocks exactly.

To impart fault tolerance, the SpMV is modified to transfer the additional data required to obtain the desired level of data redundancy. In PETSc, the SpMV operation is realized with a generalized scatter: A node determines, from the non-zero entries in its matrix rows, what vector components it requires from its neighbors to perform the SpMV product. With this information, PETSc collectively creates a communication context for the generalized scatter operation, defining what entries of a distributed vector are communicated, and where they must be communicated to.

In our experiments, we simulate node failures. Instead of taking down nodes and producing replacements during the reconstruction phase, a node will perform the operations and communication required to restore the solver state. The reconstruction process requires sending the surviving entries of the search direction vector to replacement nodes. In our experiments, this is achieved by reversing the communication that takes place during the matrix-vector product. PETSc already provides this functionality. However, reversing the communication that occurs during the matrix-vector product is not a well-defined operation. To see this, imagine a communication context that dictates that the entry in position $i_{0}$ of a vector, located in some node $A$ , must be transferred to positions $i_{1}$ , in node $B$ , and $i_{2}$ , in node $C$ . In the reverse communication process, entries in positions $i_{1}$ and $i_{2}$ will hold candidates for the value of that entry to be transferred to position $i_{0}$ . In the absence of node-failures, both candidates will be the same value, because they are copies of the original entry in $i_{0}$ , and the operation is then well defined, but, in the event of a node failure, it is possible that one of these entries is lost and the resulting candidates are different, conflicting values. Therefore, in such cases, the reverse communication process used in the reconstruction could be non-deterministic. We cope with this issue by keeping the search directions in the nodes simulating a node failure. If this information is stored, communication with the reversed context is deterministic, because there would never be a conflict between the candidates.

This problem arises because we use the reverse of a communication context intended for SpMV. In a more optimized implementation, a tailored communication context can be produced after the node failure takes place, which avoids this problem altogether by selecting the entries that we need for the reconstruction.

In our implementation, the linear system arising in line 8 of Alg. 2, is solved using a PCG solver assembled with global operations. In particular, the matrix-vector products of the submatrix $\bm{A}_{I_{f},I_{f}}$ and the subvector \IfStrEqj $\bm{x}_{I_{f}}$$\bm{x}_{I_{f}}^{(j)}$ are performed by multiplying the entire matrix $\bm{A}$ with a modified vector \IfStrEqj $\bm{x}$$\bm{x}^{(j)}$ , whose appropriate entries were set to zero. The desired $\bm{A}_{I_{f},I_{f}}\IfStrEq{j}{}{\bm{x}_{I_{f}}}{\bm{x}_{I_{f}}^{(j)}}$ is a subvector of the result of this global operation. This is less efficient than working with the actual submatrix of $\bm{A}$ , but some of the changes we introduced to increase redundancy conflict with PETSc’ ability to create submatrices. The cost to reconstruct the solution, however, remains very small compared to the overall runtime. The CG solver for the subsystem uses a block Jacobi preconditioner, with blocks matching the process’ index set. We use an approximate solver based on ILU factorization for the blocks.

Avoiding loss of orthogonality

For this section it is useful to distinguish between the solver residual, that is, vector \IfStrEqj $\bm{r}$$\bm{r}^{(j)}$ from Alg. 1, and the vector $\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}-A\IfStrEq{j}{}{\bm{x}}{\bm{x}^{(j)}}$ . These vectors are, in general, not equal in a finite-precision machine.

The CG algorithm in floating-point arithmetic undergoes loss of orthogonality, where roundoff error accumulates and the conjugacy of the search directions is lost as the solver progresses. Consequently, in regular PCG the solver residual and the vector $\IfStrEq{}{}{\bm{b}}{\bm{b}^{()}}-\bm{A}\bm{x}$ will differ slightly after convergence. Because we work with finite precision, and because we solve the local linear system in the reconstruction process (line 8 of Alg. 2) iteratively, our algorithm only reconstructs an approximation of the solver state before the node failures take place, thus potentially further contributing to this loss of orthogonality: Consequently, the ESR solver residual after convergence can be larger than the solver residual of PCG. The largest source of deviations is the solution of the local linear system of line 8 of Alg. 2. The loss of orthogonality relative to regular PCG can be controlled with the tolerance of this linear system.

To compare the accuracies of ESR and PCG in this regard, we define the relative residual difference metric:

[TABLE]

Here, $\bm{r}_{\mbox{\tiny ESR}}$ and $\bm{r}_{\mbox{\tiny PCG}}$ are the solver residual vectors of ESR with reconstruction and reference PCG respectively after convergence, and $\bm{x}_{\mbox{\tiny ESR}}$ and $\bm{x}_{\mbox{\tiny PCG}}$ are the corresponding iterands.

A side-by-side comparison of the ESR method and regular PCG can show that the former is as accurate as the latter. In Sec. 7.2 we show that the effects of using finite-precision arithmetic can be made negligible. Since node failures are uncommon and reconstruction is relatively cheap to perform, we can set the tolerance for the local system to a very small value, so that ESR converges, while the reconstruction overhead remains low.

7. Numerical experiments

Now we summarize our experimental setup and discuss our results.

7.1. Experimental setup

We use SPD matrices from the SuiteSparse Matrix Collection (Davis and Hu, 2011) as test problems, selecting problems from different application areas. Their properties are summarized in Table 1. We select medium (M1, M2) and large (M3-M8) size problems. The latter are among the largest available SPD matrices in the SuiteSparse Matrix Collection. There are problems with different numbers of non-zeros. Large matrices with relatively few non-zeros are common, but problems with more non-zero entries are more expensive to compute and would benefit the most from resilience schemes to protect the resource investment.

Our experiments are run on 128 nodes of the VSC3 system of the Vienna Scientific Cluster. Although our algorithm is well suited for multiple processes per node, we use only one process per node in our experiments. The argument for this decision is twofold: Firstly, from the point of view of resilience, the number of processes per node (cf. Sec. 1.1) makes no difference since the redundant vector elements always have to be stored on a different node (not just in the memory of another process). Secondly, the runtimes obtained in our experiments are based on simulations of node failures without ULFM (cf. Secs. 1.1 and 6). Hence, the relative runtime differences are more significant than the absolute runtimes (and, thus, improvements of the absolute runtimes due to multiple processes per node). Experiments for a matrix are run on the same set of nodes of VSC3. The system’s topology is a fat tree. We use the following libraries:

• Intel MPI 5.1.3

• PETSc 3.10.4

• Intel MKL 2018.3

. We use the Intel C compiler 18.0.5 with compiler flags -O3, -march=native and -mtune=native. We terminate the solver once the relative residual norm has been reduced by a factor of $10^{8}$ . The local linear system used during the reconstruction is terminated once its residual norm is reduced by a factor of $10^{14}$ .

We measure the runtime of the solver for several problem settings. Node failures are introduced once for each simulation, with either one, three or eight simultaneous node failures taking place at either 20%, 50% or 80% of the solver progress (measured in number of iterations). These failures are placed in contiguous ranks. Simultaneous node failures can well be caused by a faulty switch, therefore it seems like a realistic assumption that they are clustered: The node failures are introduced in neighbouring ranks either at the beginning or in the center of the vector, starting from rank 0 or 64 respectively. Additionally, we have results for runs without failures with either one, three or eight redundant copies. Each measurement in this test constellation is repeated at least 5 times.

7.2. Experimental results

Our experimental results are summarized in Tab. 2. Statistics for node failures are computed over at least 15 values: at least 5 measurements for node failures introduced at either 20%, 50% and 80% of the solver’s progress.

As expected, a larger number of redundant copies leads to larger overheads. In general, the reconstruction time remains small, with larger relative reconstruction costs for matrices with smaller runtimes, where the absolute cost is consequently smaller.

The overall relative overhead with node failures, shown in the last three columns of Tab. 2, corresponds to the sum of the relative overhead of the undisturbed case plus the relative runtime cost of the reconstruction. These columns roughly match the sum, as expected, with some variation arising from the variation in runtime of running the code in a real machine, and from differences in the number of iterations caused by the reconstruction (cf. (Pachajoa et al., 2018)).

Tab. 2 also shows that the location of the node failure does affect the reconstruction cost, as the runtime is, in general, different for node failures at the start (lower indexes) or the center (middle indexes) of the vectors. These differences come from different diagonal linear systems in the reconstruction: Submatrices formed from the index sets of different failed nodes have different properties, and they will not converge at the same rate with an ILU preconditioner, thus affecting the reconstruction time.

Our experiments show that our algorithm is very efficient for matrices with many non-zeros contained in a band close to the diagonal. Relatively denser matrices also take longer to reach convergence because of the longer time required for the matrix-vector product, so it makes a lot of sense to protect the time investment in the solution process with a resilience technique like the one presented in this paper.

Fig. 1 is an example of the runtimes and overheads obtained with our novel resilient algorithm for the matrix M5. For this matrix, the state reconstruction operation takes very little time: The runtimes for cases with failures, (orange boxes) are very close to the failure-free cases (blue boxes). In this case, the overhead for the method comes predominantly from the additional communication required to maintain redundant data.

In Fig. 2 the boxes corresponding to three redundant copies indicate a smaller runtime for simulations with node failures than for the failure-free case. As mentioned before, this can happen, since the number of iterations required after reconstruction can be a smaller than in the failure-free run (cf. (Pachajoa et al., 2018)).

Fig. 3 shows a test case where the overhead required to keep redundant data increases superlinearly with the number of node failures tolerated, As explained in Sec. 5, the growth of the overhead with the number of node failures tolerated strongly depends on the sparsity pattern of $\bm{A}$ . The matrix M8 contains many non-zeros in a band around the diagonal in the middle indexes. As expected from the analysis of Sec. 5, this is a particularly favorable case for our method: It can resist three simultaneous node failures with an overhead of around 2.5%, and eight node failures with an overhead of around 10%.

In general, the iteration at which the node failures are introduced has little influence on the runtime of the solver. Fig. 4 illustrates this for M5. We observed the same behaviour for the other test cases.

In Sec. 6, we discuss the potential loss of orthogonality that occurs when performing the reconstruction. Tab. 3 shows that the relative residual difference for our method is comparable to the one for the reference results, even for its maximum value among all experiments for a given matrix. The deviations for both methods are tiny in comparison to the reduction of the residual norm by a factor of $10^{8}$ from the solver.

8. Conclusions

In this paper, we first reviewed the ESR approach, which was initially proposed by Chen (Chen, 2011) and later refined by Pachajoa et al. (Pachajoa et al., 2018), an efficient fault-tolerance technique to protect the PCG method against a single node failure (cf. Secs. 2.2 and 3). We then proposed an enhancement to the ESR approach that allows the PCG method to tolerate simultaneous or overlapping failures of multiple nodes (cf. Sec. 4.1). Our new strategy determines where to efficiently store redundant information in order to support up to $\phi<N$ simultaneous node failures. In a theoretical analysis, we found that the communication overhead due to the distribution of the required additional redundant vector copies strongly depends on the sparsity pattern of the given system matrix (cf. Secs. 4.2 and 5).

In order to investigate the effects of floating-point arithmetic and the runtime performance of our novel algorithm, we implemented it based on the widely used library PETSc (cf. Sec. 6) and conducted numerical experiments on 128 nodes of the Vienna Scientific Cluster (cf. Sec. 7). The results of our experiments with eight large sparse matrices from real-world applications show that the proposed enhanced ESR approach is very efficient. Compared to a non-resilient PCG run, we measured runtime overheads between 2.2% and 24.1% for an undisturbed run with up to three tolerated simultaneous node failures and between 2.8% and 55.0% for a run with three actual simultaneous node failures including reconstruction.

In future work, we are going to further enhance the ESR approach such that it automatically adapts to different sparsity patterns of matrices. Moreover, we want to investigate communication-avoiding PCG methods. Another interesting future direction is to find a variant of the ESR approach that does not depend on the availability of replacement nodes for failed nodes.

Acknowledgements.

This work has been funded by the Sponsor Vienna Science and Technology Fund (WWTF) https://www.wwtf.at/ through project Grant #ICT15-113. The computational results presented have been achieved using the Vienna Scientific Cluster (VSC).

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Agullo et al . (2013) E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: recover-restart strategies . Research Report RR-8324. INRIA.
3Agullo et al . (2016) E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon. 2016. Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Lin. Algebra. Appl. 23, 5 (2016), 888–905.
4Balay et al . (2018) S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, D. A. May, L. Curfman Mc Innes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. 2018. PET Sc Users Manual . Technical Report ANL-95/11 - Revision 3.9. Argonne National Laboratory.
5Balay et al . (1997) S. Balay, W. D. Gropp, L. C. Mc Innes, and B. F. Smith. 1997. Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries . Birkhäuser Boston, 163–202.
6Ballard et al . (2014) G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23 (2014), 1–155.
7Bland et al . (2013) W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. 2013. Post-failure recovery of MPI communication capability: Design and rationale. Int. J. High Perform. Comput. Appl. 27, 3 (2013), 244–254.
8Bosilca et al . (2014) G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra. 2014. Assessing the Impact of ABFT and Checkpoint Composite Strategies. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops . 679–688.