Numerically Stable Recurrence Relations for the Communication Hiding   Pipelined Conjugate Gradient Method

Siegfried Cools; Jeffrey Cornelis; Wim Vanroose

arXiv:1902.03100·cs.NA·May 16, 2019

Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method

Siegfried Cools, Jeffrey Cornelis, Wim Vanroose

PDF

Open Access

TL;DR

This paper introduces a numerically stable variant of the pipelined Conjugate Gradient method, enhancing accuracy and parallel performance for large-scale linear system solutions without increasing computational cost.

Contribution

It presents a new two-term recurrence relation that improves numerical stability of the pipelined Conjugate Gradient method, enabling high accuracy regardless of pipeline length.

Findings

01

Achieves high accuracy independently of pipeline length

02

Demonstrates excellent parallel performance

03

Resolves stability issues in pipelined Krylov methods

Abstract

Pipelined Krylov subspace methods (also referred to as communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping global communication with essential computations like the matrix-vector product, thus hiding global communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that…

Tables1

Table 1. TABLE I : Theoretical specifications of different CG variants (no preconditioning); ‘CG’ denotes classic CG; ‘p-CG’ is Ghysels’ pipelined CG [ 19 ] . Columns glsync and spmv list the number of synchronization phases and spmv s per iteration. The column Flops indicates the number of flops ( × N absent 𝑁 \times N ) required to compute axpy s and dot products (with l ≥ 1 𝑙 1 l\geq 1 ). The Time column shows the time spent in glred s (global all-reduce communications) and spmv s. Memory counts the total number of vectors in memory (excl. x i − l subscript 𝑥 𝑖 𝑙 x_{i-l} and b 𝑏 b ) at any time during execution.

	gl		Flops	Time	Memory
	sync	spmv	axpy&dot	glred&spmv
CG	2	1	10	2 glred + 1 spmv	3
p-CG	1	1	16	$\max$ (glred, spmv)	6
Alg. 1	1	1	$6 l + 10$	$\max$ (glred $/ l$ , spmv)	$\max$ ( $3 l + 3$ , $7$ )
Alg. 2	1	1	$6 l + 10$	$\max$ (glred $/ l$ , spmv)	$\max$ ( $4 l + 1$ , $7$ )

Equations148

A V_{j} = V_{j + 1} T_{j + 1, j}, 1 \leq j \leq i - l .

A V_{j} = V_{j + 1} T_{j + 1, j}, 1 \leq j \leq i - l .

T_{j + 1, j} = γ_{0} δ_{0} δ_{0} γ_{1} ⋱ ⋱ ⋱ δ_{j - 2} δ_{j - 2} γ_{j - 1} δ_{j - 1} .

T_{j + 1, j} = γ_{0} δ_{0} δ_{0} γ_{1} ⋱ ⋱ ⋱ δ_{j - 2} δ_{j - 2} γ_{j - 1} δ_{j - 1} .

v_{j + 1} = (A v_{j} - γ_{j} v_{j} - δ_{j - 1} v_{j - 1}) / δ_{j}, 0 \leq j < i - l .

v_{j + 1} = (A v_{j} - γ_{j} v_{j} - δ_{j - 1} v_{j - 1}) / δ_{j}, 0 \leq j < i - l .

z_{j} = ⎩ ⎨ ⎧ v_{0}, P_{j} (A) v_{0}, P_{l} (A) v_{j - l}, j = 0, 0 < j \leq l, l < j \leq i,

z_{j} = ⎩ ⎨ ⎧ v_{0}, P_{j} (A) v_{0}, P_{l} (A) v_{j - l}, j = 0, 0 < j \leq l, l < j \leq i,

P_{k} (t) = j = 0 \prod k - 1 (t - σ_{j}), k \leq l,

P_{k} (t) = j = 0 \prod k - 1 (t - σ_{j}), k \leq l,

z_{j + 1} = {(A - σ_{j} I) z_{j}, (A z_{j} - γ_{j - l} z_{j} - δ_{j - l - 1} z_{j - 1}) / δ_{j - l}, 0 \leq j < l, l \leq j < i,

z_{j + 1} = {(A - σ_{j} I) z_{j}, (A z_{j} - γ_{j - l} z_{j} - δ_{j - l - 1} z_{j - 1}) / δ_{j - l}, 0 \leq j < l, l \leq j < i,

A Z_{j} = Z_{j + 1} B_{j + 1, j}, 1 \leq j \leq i,

A Z_{j} = Z_{j + 1} B_{j + 1, j}, 1 \leq j \leq i,

g_{j, i}

g_{j, i}

= (v_{i - l}, z_{j + l}) = g_{i - l, j + l} .

v_{j + 1} = z_{j + 1} - k = j - 2 l + 1 \sum j g_{k, j + 1} v_{k} / g_{j + 1, j + 1} .

v_{j + 1} = z_{j + 1} - k = j - 2 l + 1 \sum j g_{k, j + 1} v_{k} / g_{j + 1, j + 1} .

g_{j, i + 1} = {(z_{i + 1}, v_{j}); (z_{i + 1}, z_{j}); j = max (0, i - 2 l + 1), \dots, i - l + 1, j = i - l + 2, \dots, i + 1,

g_{j, i + 1} = {(z_{i + 1}, v_{j}); (z_{i + 1}, z_{j}); j = max (0, i - 2 l + 1), \dots, i - l + 1, j = i - l + 2, \dots, i + 1,

g_{j, i - l + 1} = \frac{g _{j, i - l + 1} - \sum _{k = i - 3 l + 1}^{j - 1} g _{k, j} g _{k, i - l + 1}}{g _{j, j}},

g_{j, i - l + 1} = \frac{g _{j, i - l + 1} - \sum _{k = i - 3 l + 1}^{j - 1} g _{k, j} g _{k, i - l + 1}}{g _{j, j}},

g_{i - l + 1, i - l + 1} = g_{i - l + 1, i - l + 1} - k = i - 3 l + 1 \sum i - l g_{k, i - l + 1}^{2} .

g_{i - l + 1, i - l + 1} = g_{i - l + 1, i - l + 1} - k = i - 3 l + 1 \sum i - l g_{k, i - l + 1}^{2} .

⎩ ⎨ ⎧ \frac{g _{i - l, i - l + 1} + σ _{i - l} g _{i - l, i - l} - g _{i - l - 1, i - l} δ _{i - l - 1}}{g _{i - l, i - l}}, l \leq i < 2 l, \frac{g _{i - l, i - l} γ _{i - 2 l} + g _{i - l, i - l + 1} δ _{i - 2 l} - g _{i - l - 1, i - l} δ _{i - l - 1}}{g _{i - l, i - l}}, i \geq 2 l .

⎩ ⎨ ⎧ \frac{g _{i - l, i - l + 1} + σ _{i - l} g _{i - l, i - l} - g _{i - l - 1, i - l} δ _{i - l - 1}}{g _{i - l, i - l}}, l \leq i < 2 l, \frac{g _{i - l, i - l} γ _{i - 2 l} + g _{i - l, i - l + 1} δ _{i - 2 l} - g _{i - l - 1, i - l} δ _{i - l - 1}}{g _{i - l, i - l}}, i \geq 2 l .

δ_{i - l} = {g_{i - l + 1, i - l + 1} / g_{i - l, i - l}, (g_{i - l + 1, i - l + 1} δ_{i - 2 l}) / g_{i - l, i - l}, l \leq i < 2 l, i \geq 2 l .

δ_{i - l} = {g_{i - l + 1, i - l + 1} / g_{i - l, i - l}, (g_{i - l + 1, i - l + 1} δ_{i - 2 l}) / g_{i - l, i - l}, l \leq i < 2 l, i \geq 2 l .

0

0

= V_{m}^{T} (V_{m} ∥ r_{0} ∥_{2} e_{1}) - T_{m} y_{m} = ∥ r_{0} ∥_{2} e_{1} - T_{m} y_{m},

1 λ_{1} 1 ⋱ ⋱ λ_{i - l} 1 η_{0} δ_{0} η_{1} ⋱ ⋱ δ_{i - l - 1} η_{i - l} .

1 λ_{1} 1 ⋱ ⋱ λ_{i - l} 1 η_{0} δ_{0} η_{1} ⋱ ⋱ δ_{i - l - 1} η_{i - l} .

λ_{j} = δ_{j - 1} / η_{j - 1} and η_{j} = γ_{j} - λ_{j} δ_{j - 1} .

λ_{j} = δ_{j - 1} / η_{j - 1} and η_{j} = γ_{j} - λ_{j} δ_{j - 1} .

x_{i - l + 1}

x_{i - l + 1}

= x_{0} + V_{i - l + 1} U_{i - l + 1}^{- 1} L_{i - l + 1}^{- 1} ∥ r_{0} ∥_{2} e_{1}

= x_{0} + P_{i - l + 1} q_{i - l + 1},

p_{j} = η_{j}^{- 1} (v_{j} - δ_{j - 1} p_{j - 1}), 1 \leq j \leq i - l .

p_{j} = η_{j}^{- 1} (v_{j} - δ_{j - 1} p_{j - 1}), 1 \leq j \leq i - l .

x_{i - l} = x_{i - l - 1} + ζ_{i - l - 1} p_{i - l - 1} .

x_{i - l} = x_{i - l - 1} + ζ_{i - l - 1} p_{i - l - 1} .

Z_{i}^{T} Z_{i} = G_{i}^{T} V_{i}^{T} V_{i} G_{i} = G_{i}^{T} G_{i} .

Z_{i}^{T} Z_{i} = G_{i}^{T} V_{i}^{T} V_{i} G_{i} = G_{i}^{T} G_{i} .

G_{i}^{- 1} = (Z_{i}^{T} Z_{i})^{- 1} G_{i}^{T} .

G_{i}^{- 1} = (Z_{i}^{T} Z_{i})^{- 1} G_{i}^{T} .

Z_{l + 1 : i}^{T} Z_{l + 1 : i} = (P_{l} (A) V_{i - l})^{T} P_{l} (A) V_{i - l} = V_{i - l}^{T} P_{l} (A)^{2} V_{i - l},

Z_{l + 1 : i}^{T} Z_{l + 1 : i} = (P_{l} (A) V_{i - l})^{T} P_{l} (A) V_{i - l} = V_{i - l}^{T} P_{l} (A)^{2} V_{i - l},

σ_{i} = \frac{λ _{m a x} + λ _{m i n}}{2} + \frac{λ _{m a x} - λ _{m i n}}{2} cos (\frac{( 2 i + 1 ) π}{2 l}),

σ_{i} = \frac{λ _{m a x} + λ _{m i n}}{2} + \frac{λ _{m a x} - λ _{m i n}}{2} cos (\frac{( 2 i + 1 ) π}{2 l}),

\overset{ˉ}{V}_{i}^{T} (b - A \overset{x}{ˉ}_{i})

\overset{ˉ}{V}_{i}^{T} (b - A \overset{x}{ˉ}_{i})

= \overset{ˉ}{V}_{i}^{T} \overset{v}{ˉ}_{0} ∥ \overset{r}{ˉ}_{0} ∥ - \overset{ˉ}{V}_{i}^{T} A \overset{ˉ}{V}_{i} \overset{ˉ}{T}_{i}^{- 1} ∥ \overset{r}{ˉ}_{0} ∥ e_{1} = 0.

∥ α v - fl (α v) ∥

∥ α v - fl (α v) ∥

∥ v + w - fl (v + w) ∥

∥ A v - fl (A v) ∥

\overset{x}{ˉ}_{i}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMatrix Theory and Algorithms · Electromagnetic Scattering and Analysis · Advanced Numerical Methods in Computational Mathematics

Full text

Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method

Siegfried Cools∗, Jeffrey Cornelis∗, Wim Vanroose∗

Applied Mathematics Group, Department of Mathematics and Computer Science, University of Antwerp. Address: University of Antwerp, Campus Middelheim, Building G, Middelheimlaan 1, 2020 Antwerp, Belgium. E-mail: [email protected] (corresponding author). Funding: S. Cools is funded by Research Foundation Flanders (FWO) under grant 12H4617N. J. Cornelis receives funding from the University of Antwerp Research Council under the University Research Fund (BOF).Manuscript received ; revised (TBA).

Abstract

Pipelined Krylov subspace methods (also referred to as communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method, p( $l$ )-CG, outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping global communication with essential computations like the matrix-vector product, thus “hiding” global communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by $\ell$ three-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

Index Terms:

Krylov subspace methods, Pipelining, Parallel performance, Global communication, Latency hiding, Conjugate Gradients, Numerical stability, Inexact computations, Attainable accuracy.

AMS subject classifications: 65F10, 65N12, 65G50, 65Y05, 65N22.

1 Introduction

The family of iterative solvers known as Krylov subspace methods (KSMs) [29, 35, 43] are among the most efficient present-day methods for solving large scale sparse systems of linear equations. The mother of all Krylov subspace methods is undoubtedly the Conjugate Gradient method (CG) that was derived in 1952 [25] to the aim of solving linear systems $Ax=b$ with a symmetric positive definite and preferably sparse matrix $A$ . The CG method is one of the most widely used methods for solving said systems today, which form the basis of a plethora of scientific and industrial applications. However, driven by the essential transition from optimal single node performance towards massively parallel computer hardware over the last decades [37], the bottleneck for fast execution of Krylov subspace methods has shifted. Whereas in the past the application of the sparse matrix-vector product (spmv) was considered the most time-consuming part of the algorithm, the global synchronizations required in dot product and norm computations form the main bottleneck for efficient execution on present-day distributed memory hardware [15].

Driven by the increasing levels of parallelism in present-day HPC machines, as attested by the current strive towards exascale high-performance computing software [14], research on the elimination of the global communication bottleneck has recently regained significant attention from the international computer science, engineering and numerical mathematics communities. Sprouting largely from pioneering work on reducing communication in Krylov subspace methods from the late 1980’s and 90’s [40, 31, 6, 12, 13, 17], a number of variants of the classic Krylov subspace algorithms have been introduced over the last years. We point out recent work by Chronopoulos et al. [7], Hoemmen [27], Carson et al. [4], McInnes et al. [30], Grigori et al. [23], Eller et al. [16], Imberti et al. [28] and Zhuang et al. [47].

The contents of the current work are situated in the research branch on so-called “pipelined” Krylov subspace methods111Note: In the context of communication reduction in Krylov subspace methods, the terminology “pipelined” KSMs that is used throughout the related applied linear algebra and computer science literature refers to software pipelining, i.e. algorithmic reformulations to the KSM procedure in order to reduce communication overhead, and should not be confused with hardware-level instruction pipelining (ILP). [18, 19, 9]. Alternatively called “communication hiding” methods, these algorithmic variations to classic Krylov subspace methods are designed to overlap the time-consuming global communications in each iteration of the algorithm with computational tasks such as calculating spmvs or axpys (vector operations of the form $y\leftarrow\alpha x+y$ ). Thanks to the reduction/elimination of the synchronization bottleneck, pipelined algorithms have been shown to increase parallel efficiency by allowing the algorithm to continue scaling on large numbers of processors [36, 46]. However, the algorithmic reformulations that allows for this efficiency increase come at the cost of reduced numerical stability [19, 5], which presently is one of the main drawbacks of pipelined (as well as other communication reducing) methods. Research on analyzing and improving the numerical stability of pipelined Krylov subspace methods, which is essential both for a proper understanding and the practical usability of the methods, has recently been performed by the authors [10, 8] and others [3, 5].

This work presents a numerically stable variant of the $l$ -length pipelined Conjugate Gradient method, p( $l$ )-CG for short. The p( $l$ )-CG method was presented in [11] and allows to overlap each global reduction phase with the computational work of $l$ subsequent iterations. The pipeline length $l$ is a parameter of the method that can be chosen depending on the problem and hardware setup (as a function of the communication-to-computation ratio). As is the case for all communication reducing Krylov subspace methods, the preconditioner choice influences the communication-to-computation ratio and thus affects the performance of the method. The pipeline length hence also depends on the effort invested in the preconditioner. A preconditioner that uses limited global communication (block Jacobi, no-overlap DDM, …) is generally preferred in this setting.

The propagation of local rounding errors in the multi-term recurrence relations of the p( $l$ )-CG algorithm is the primary source of loss of attainable accuracy on the final solution [11]. By introducing intermediate auxiliary basis variables, we derive a new p( $l$ )-CG algorithm with modified recurrence relations for which no rounding error propagation occurs. It is proven analytically that the resulting recurrence relations are numerically stable for any pipeline length $l$ . The new algorithm is guaranteed to reach the same accuracy as the classic CG method. This work thus resolves one of the major restrictions for the practical use of pipelined Krylov subspace methods. The redesigned algorithm comes at no additional computational cost and has only a minor storage overhead compared to the former p( $l$ )-CG algorithm, thus effectively replacing the earlier implementation of the method. In addition, it is shown that formulating a preconditioned version of the new algorithm is straightforward.

We conclude this introduction by providing a short overview of the further contents of this manuscript. Section 2 presents a high-level summary of the basic principles behind the $l$ -length pipelined CG method and formulates the key numerical properties of the method that motivate this work. It familiarizes the reader with the notations and concepts used throughout this paper. In Section 3 the numerical stability analysis of the p( $l$ )-CG recurrence relations is briefly recapped, as it forms the basis for the analysis of the stable algorithm in Section 4.4. Section 4 contains the main contributions of this work, presenting the technical details of the stable p( $l$ )-CG algorithm alongside an overview of its main implementation properties and a numerical analysis of the new rounding error resilient recurrence relations. Numerical experiments that demonstrate both the parallel performance of the p( $l$ )-CG method and the excellent attainable accuracy in comparison to earlier variants of pipelined Krylov subspace methods are presented in Section 5. The manuscript is concluded in Section 6.

For completeness we note that the numerical analysis in Section 4 focuses on analyzing the propagation of local rounding errors throughout the new p( $l$ )-CG algorithm in detail, but does not include a standard forward or backward stability analysis with bounds on the local rounding errors.

2 Deep pipelined Conjugate Gradients

The deep pipelined Conjugate Gradient method, denoted p( $l$ )-CG for short, was first presented in [11], where it was derived in analogy to the p( $l$ )-GMRES method [18]. The parameter $l$ represents the pipeline length which indicates the number of iterations that are overlapped by each global reduction phase. We summarize the current state-of-the-art deep pipelined p( $l$ )-CG method below, which forms the starting point for the discussion in this work.

2.1 Basis recurrence relations in exact arithmetic

Let $V_{i-l+1}=[v_{0},v_{1},\ldots,v_{i-l}]$ be the orthonormal basis for the Krylov subspace $\mathcal{K}_{i-l+1}(A,v_{0})$ in iteration $i$ of the p( $l$ )-CG algorithm, consisting of $i-l+1$ vectors. Here $A$ is a symmetric positive definite matrix. The Krylov subspace basis vectors satisfy the Lanczos relation

[TABLE]

with

[TABLE]

Let $\delta_{-1}=0$ , then the Lanczos relation (1) translates in vector notation to

[TABLE]

The auxiliary basis $Z_{i+1}=[z_{0},z_{1},\ldots,z_{i}]$ runs $l$ vectors ahead of the basis $V_{i-l+1}$ and is defined as

[TABLE]

where the matrix polynomial $P_{l}(A)$ is given by

[TABLE]

with optional stabilizing shifts $\sigma_{j}\in\mathbb{R}$ , see [18, 11, 27]. We refer to Section 2.2 for a discussion on the Krylov subspace basis $Z_{i+1}$ , i.e. the choice of the polynomial $P_{l}(A)$ . Contrary to the basis $V_{i-l+1}$ , the auxiliary basis $Z_{i+1}$ is in general not orthonormal. It is constructed using the recursive definitions

[TABLE]

which are obtained by multiplying the Lanczos relation (3) on both sides by $P_{l}(A)$ . Expression (6) translates into a Lanczos type matrix relation

[TABLE]

where the matrix $B_{j+1,j}$ contains the matrix $T_{j-l+1,j-l}$ , which is shifted $l$ places along the main diagonal. The bases $V_{j}$ and $Z_{j}$ are connected through the basis transformation $Z_{j}=V_{j}G_{j}$ for $1\leq j\leq i-l+1$ , where $G_{j}$ is a banded upper triangular matrix with a band width of $2l+1$ non-zero diagonals [11]. For a symmetric matrix $A$ the matrix $G_{i+1}$ is symmetric around its $l$ -th upper diagonal, since

[TABLE]

The following recurrence relation for $v_{j+1}$ is derived from the basis transformation (with $0\leq j<i-l$ ):

[TABLE]

A total of $l$ iterations after the dot-products

[TABLE]

have been initiated, the elements $g_{j,i-l+1}$ with $i-2l+2\leq j\leq i-l+1$ , which were computed as $(z_{i-l+1},z_{j})$ , are corrected as follows:

[TABLE]

for $j=i-2l+2,\ldots,i-l$ , and:

[TABLE]

Additionally, in the $i$ -th iteration the tridiagonal matrix $T_{i-l+2,i-l+1}$ , see (1), can be updated recursively by adding one column. The diagonal element $\gamma_{i-l}$ is characterized by the expressions: $\gamma_{i-l}=$

[TABLE]

The term $-g_{i-l-1,i-l}\delta_{i-l-1}$ is considered zero when $i=l$ . The update for the off-diagonal element $\delta_{i-l}$ is given by

[TABLE]

The element $\delta_{i-l-1}$ has already been computed in the previous iteration and can thus simply be copied due to the symmetry of $T_{i-l+2,i-l+1}$ .

Once the basis $V_{i-l+1}$ has been constructed, the solution $x_{i-l}$ can be updated based on a search direction $p_{i-l}$ , following the classic derivation of D-Lanczos in [35], Sec. 6.7.1. The Ritz-Galerkin condition

[TABLE]

implies $y_{m}=T_{m}^{-1}{\|r_{0}\|}_{2}e_{1}$ . The LU-factorization of the tridiagonal matrix $T_{i-l+1}=L_{i-l+1}U_{i-l+1}$ is given by

[TABLE]

Note that $\gamma_{0}=\eta_{0}$ . It follows from (16) that the elements of the lower/upper triangular matrices $L_{i-l+1}$ and $U_{i-l+1}$ are given by (with $1\leq j\leq i-l$ )

[TABLE]

Expression (2.1) indicates that the approximate solution $x_{i-l+1}$ equals

[TABLE]

where $P_{i-l+1}=V_{i-l+1}U_{i-l+1}^{-1}$ and $q_{i-l+1}=L_{i-l+1}^{-1}\|r_{0}\|_{2}e_{1}$ . Note that $p_{0}=v_{0}/\eta_{0}$ . The columns $p_{j}$ (for $1\leq j\leq i-l$ ) of $P_{i-l+1}$ can be computed recursively. From $P_{i-l+1}U_{i-l+1}=V_{i-l+1}$ it follows

[TABLE]

Denoting the vector $q_{i-l+1}$ by $\left[\zeta_{0},\ldots,\zeta_{i-l}\right]^{T}$ , it follows from $L_{i-l+1}q_{i-l+1}=\|r_{0}\|_{2}e_{1}$ that $\zeta_{0}={\|r_{0}\|}_{2}$ and $\zeta_{j}=-\lambda_{j}\zeta_{j-1}$ for $1\leq j\leq i-l$ . Using the search direction $p_{i-l-1}$ and the scalar $\zeta_{i-l-1}$ , the approximate solution $x_{i-l}$ is updated using the recurrence relation:

[TABLE]

The above expressions are combined in Alg. 1. Once the initial pipeline for $z_{0},\ldots,z_{l}$ has been filled, the relations (6)-(9) are used to recursively compute the basis vectors $v_{i-l+1}$ and $z_{i+1}$ in iterations $i\geq l$ (see lines 15-16). The scalar results of the global reduction phase (line 18) are required $l$ iterations later (line 5-6). In every iteration global communication is thus overlapped with the computational work of $l$ subsequent iterations, forming the heart of the communication hiding p( $l$ )-CG algorithm.

Remark 1.

Residual norm in p( $l$ )-CG. Note that the residual $r_{j}=b-Ax_{j}$ is not computed in Alg. 1, but its norm is characterized by the quantity $|\zeta_{j}|=\|r_{j}\|$ for $0\leq j\leq i-l$ . This quantity can be used to formulate a stopping criterion for the p( $l$ )-CG iteration, see Alg. 1 line 27.

Remark 2.

Dot products in p( $l$ )-CG. Although Alg. 1, line 18 indicates that in each iteration $i\geq(2l+1)$ a total of $(2l+1)$ dot products need to be computed, the number of dot product computations can be limited to $(l+1)$ by exploiting the symmetry of the matrix $G_{i+1}$ , see expression (2.1). Since $g_{j,i+1}=g_{i-l+1,j+l}$ for $j\leq i+1$ , only the dot products $(z_{j},z_{i+1})$ for $j=i-l+2,\ldots,i+1$ and the $l$ -th upper diagonal element $(v_{i-l+1},z_{i+1})$ need to be computed in iteration $i$ .

2.2 On the conditioning of the auxiliary basis

As $V_{i}$ is an orthonormal basis, the transformation $Z_{i}=V_{i}G_{i}$ can be interpreted as a QR factorization of the auxiliary basis $Z_{i}$ . Moreover, it holds that $G_{i}^{T}G_{i}$ is the Cholesky factorization of $Z_{i}^{T}Z_{i}$ , since

[TABLE]

The elements of the transformation matrix $G_{i}$ are computed on lines 5-6 of Alg. 1 precisely by means of this Cholesky factorization. This observation leads to the following two essential insights related to the conditioning of the basis $Z_{i}$ .

Remark 3.

Square root breakdowns in p( $l$ )-CG. The auxiliary basis vectors $z_{j}$ are defined as $P_{l}(A)v_{j-l}$ , but the basis $Z_{i}$ is in general not orthogonal. Hence, vectors $z_{j}\in Z_{i}$ are not necessarily linearly independent. In particular for longer $l$ , different $z_{j}$ vectors are expected to become more and more aligned. This leads to $Z_{i}^{T}Z_{i}$ being ill-conditioned, approaching singularity as $i$ increases. The effect is the most pronounced when $\sigma_{0}=\ldots=\sigma_{l-1}=0$ , in which case $P_{l}(A)=A^{l}$ . Shifts $\sigma_{j}$ can be set to improve the conditioning of $Z_{i}^{T}Z_{i}$ , see also Remark 4.

When for certain $i$ the matrix $Z_{i}^{T}Z_{i}$ becomes (numerically) singular, the Cholesky factorization procedure in p( $l$ )-CG will fail. This may manifest in the form of a square root breakdown on line 7 in Alg. 1 when the root argument $g_{i-l+1,i-l+1}-\sum_{k=i-3l+1}^{i-l}g_{k,i-l+1}^{2}$ becomes negative. Numerical round-off errors may increase the occurrence of these breakdowns in practice. When a breakdown occurs in p( $l$ )-CG the iteration is restarted, in analogy to the GMRES algorithm, although it should be noted that the nature of the breakdown in both algorithms is quite different. Evidently, the restarting strategy may delay convergence compared to standard CG, where no square root breakdowns occur.

Remark 4.

*Choice of the auxiliary basis and relation to the shifts. *** It follows from the Cholesky factorization (21) that the inverse of the transformation matrix $G_{i}$ is

[TABLE]

The conditioning of the matrix $G_{i}^{-1}$ is thus determined by the conditioning of $Z_{i}^{T}Z_{i}$ . Furthermore, it holds that

[TABLE]

where $Z_{l+1:i}$ is a part of the basis $Z_{i}$ obtained by dropping the first $l$ columns. Hence, the polynomial $P_{l}(A)^{2}$ has a major impact on the conditioning of the matrix $G_{i}^{-1}$ , which in turn plays a crucial role in the propagation of local rounding errors in the p( $l$ )-CG algorithm **[11]**, see Section 3.1 of the current work. This observation indicates intuitively why $\|P_{l}(A)\|_{2}$ should preferably be as small as possible, which can be achieved by choosing appropriate values for the shifts $\sigma_{j}$ . Optimal shift values in the sense of minimizing the Euclidean $2$ -norm of $P_{l}(A)$ are the Chebyshev shifts **[27, 18, 11]** (for $i=0,\ldots,l-1$ ):

[TABLE]

*which are used throughout this work. This requires a notion of the largest (smallest) eigenvalue $\lambda_{\max}$ (resp. $\lambda_{\min}$ ), which can be estimated a priori, e.g. by a few power method iterations. *

2.3 Loss of orthogonality and attainable accuracy

This work is motivated by the observation that two main issues affect the convergence of pipelined CG methods: loss of basis vector orthogonality and inexact Lanczos relations. We comment briefly on both issues from a high-level point of view and clearly mark the scope of this work. Important insights about the similarities and differences between classic CG and p( $l$ )-CG are highlighted before going into more details on the numerics in Sections 3 and 4.

2.3.1 Loss of orthogonality

It is well-known that in finite precision arithmetic the orthogonality of the Krylov subspace basis $V_{i}$ , i.e. $V^{T}_{i}V_{i}=I_{i}$ (identity matrix) may not hold exactly. Inexact orthogonality may appear in every variant of the CG algorithm, see [29], in particular in the D-Lanczos222Note: The D-Lanczos (short for “direct Lanczos”) algorithm is a variant of the CG method that is equivalent to the latter in exact arithmetic, save for the solution of the system $T_{i}y_{i}=\|r_{0}\|e_{1}$ which is computed by using Gaussian elimination in D-Lanczos. The D-Lanczos method is the basic Krylov subspace method from which the p( $l$ )-CG method was derived, see [11], Section 2, for details. algorithm [35], where a new basis vector is constructed by orthogonalizing with respect to the previous two basis vectors, as well as in the related p( $l$ )-CG method, Alg. 1. Loss of orthogonality typically leads to delay of convergence, meaning the residual deviates from the one in the scenario in which orthogonality would not be lost.

We use a notation with bars to designate variables that are actually computed in a finite precision implementation of the algorithm. The key relation for the Conjugate Gradient method is the Ritz-Galerkin condition:

[TABLE]

This equality only holds under the assumption that $\bar{V}_{i}^{T}A\bar{V}_{i}=\bar{T}_{i}$ which requires $\bar{V}_{i}^{T}\bar{V}_{i+1}=I_{i,i+1}$ . Note that in finite precision arithmetic the convergence delay can be observed in both the actual residual norm $\|b-A\bar{x}_{i}\|$ as well as the recursively computed residual norm $\|\bar{r}_{i}\|$ , since both quantities are based on the (possibly non-orthogonal) basis $\bar{V}_{i+1}$ , see Fig. 1(a)-2(a) (discussed further in Section 2.3.3).

2.3.2 The inexact Lanczos relation

The basis vectors in the pipelined CG algorithm are not computed explicitly using the Lanczos relation (1). Rather, they are computed recursively, see (9), to avoid the computation of additional spmvs. In finite precision, local rounding errors in the recurrence relation may contaminate the basis $\bar{V}_{i}$ , such that the Lanczos relation $A\bar{V}_{i}-\bar{V}_{i+1}\bar{T}_{i+1,i}=0$ is no longer valid. Moreover, due to propagation of local rounding errors, $\|A\bar{V}_{i}-\bar{V}_{i+1}\bar{T}_{i+1,i}\|$ may grow dramatically as the iteration proceeds. Using the classic model for floating point arithmetic with machine precision $\epsilon$ [33, 22, 21, 45], the round-off error on basic computations on the matrix $A\in\mathbb{R}^{n\times n}$ , vectors $v$ , $w$ and a scalar $\alpha$ are bounded by

[TABLE]

Here $\text{fl}(\cdot)$ indicates the finite precision floating point representation, $\mu$ is the maximum number of nonzeros in any row of $A$ , and the norm $\|\cdot\|$ represents the Euclidean 2-norm. Under this model the recurrence relations for $\bar{x}_{i}$ and $\bar{p}_{i}$ in a finite precision implementation of p( $l$ )-CG are

[TABLE]

with expression (27) translating in matrix notation to

[TABLE]

Recall that in exact arithmetic $P_{i}=[p_{0},\ldots,p_{i-1}]=V_{i}U^{-1}_{i}$ . In these expressions $\Xi_{i}^{\bar{x}}=[\xi^{\bar{x}}_{1},\ldots,\xi^{\bar{x}}_{i}]$ and $\Xi^{\bar{p}}_{i}=-[\bar{\eta}_{0}\xi^{\bar{p}}_{0},,\ldots,\bar{\eta}_{i-1}\xi^{\bar{p}}_{i-1}]$ are local rounding errors which are bounded by $\|\xi^{\bar{x}}_{i}\|\leq(\|\bar{x}_{i-1}\|+2\,|\bar{\zeta}_{i-1}|\,\|\bar{p}_{i-1}\|)\,\epsilon$ and $\|\xi^{\bar{p}}_{i}\|\leq(2/\bar{\eta}_{i}\,\|\bar{v}_{i-1}\|+3\,|\bar{\delta}_{i-1}|/\bar{\eta}_{i}\,\|\bar{p}_{i-1}\|)\,\epsilon$ , and $\mathbb{1}=[1,1,\ldots,1]^{T}$ . The actual residual satisfies the following relations in a finite precision setting:

[TABLE]

where $\xi^{\bar{r}}_{0}=(b-A\bar{x}_{0})-\bar{r}_{0}$ . The recursively computed residual $\bar{r}_{i}=-\bar{\delta}_{i-1}(e_{i}^{T}\bar{U}_{i}^{-1}\bar{q}_{i})\bar{v}_{i}=\bar{\xi}_{i}\bar{v}_{i}$ that appears in expression (2.3.2) tends to zero. The actual residual norm $\|b-A\bar{x}_{i}\|$ , on the other hand, stagnates around $\|(A\bar{V}_{i}-\bar{V}_{i+1}\bar{T}_{i+1,i})\bar{U}_{i}^{-1}\bar{q}_{i}\|$ , a quantity referred to as the maximal attainable accuracy of the method. The difference between the norm of the actual residual and the recursively computed residual is illustrated in Fig. 1(a). A detailed analysis of the deviation from the Lanczos relation in finite precision (“inexact Lanczos”) can be found in Section 3.

2.3.3 Scope and limitations of this manuscript

The issue of inexact Lanczos relations in p( $l$ )-CG is timely and deserving of attention. Loss of accuracy resulting from the inexact Lanczos relation has long been a limiting factor in applying p( $l$ )-CG and related algorithms in practice. Fig. 1(a) illustrates how the norms of the actual residuals $\|b-A\bar{x}_{i}\|$ stagnate while the recursively computed residual norms $\|\bar{r}_{i}\|$ continue to decrease. For p( $l$ )-CG local rounding errors in the recurrence relations are propagated leading to reduced attainable accuracy compared to D-Lanczos. Moreover, while loss of orthogonality also warrants further investigation, this issue is not exclusive to pipelined methods. Delayed convergence is observed in classic Krylov subspace methods also, see Fig. 1, whereas loss of attainable accuracy is not. The issue could be addressed by e.g. re-orthogonalizing the basis, see [42]. However, communication-reducing methods are not particularly suitable to include re-orthogonalization, since this introduces additional global reduction phases.

Although loss of orthogonality does not originate from applying the pipelining technique, it may be more pronounced for pipelined methods compared to their classic counterparts, see Fig. 1(b). However, Fig. 1(a) indicates that the effect of the inexact Lanczos relation on convergence is much more dramatic than the effect of inexact orthogonality for all pipeline lengths $l$ . This manuscript thus focuses on improving the numerical stability of the p( $l$ )-CG method by neutralizing the propagation of local rounding errors in the recursively computed basis vector updates. As such, this work proposes a key step towards a numerically stable communication hiding variant of the CG method.

3 Analyzing rounding error propagation

This section recaps the analysis of local rounding errors that stem from the recurrence relations in the pipelined p( $l$ )-CG method, Alg. 1. It aims to precisely explain the source of the loss of accuracy observed for the p( $l$ )-CG method. The methodology for the analysis is similar to the one used in classic works by Paige [33, 34], Greenbaum [20, 22, 21], Gutknecht [24], Strakos [41], Meurant [32], Sleijpen [38, 39], Van der Vorst [44], Higham [26], and others.

Finite precision variants of the exact scalar and vector variables introduced in Section 2 are denoted by a bar symbol in this section. We differentiate between “actual” vector variables, which satisfy the Lanczos relations exactly but are not computed in the algorithm, and “recursively computed” variables, which contain machine-precision sized round-off errors related to finite precision computations.

3.1 Local rounding error behavior in finite precision

For any $j\geq 0$ the true basis vector, denoted by $\bar{\mathbb{v}}_{j+1}$ , satisfies the Lanczos relation (3) exactly, that is, for $0\leq j<i-l$ :

[TABLE]

without the addition of a local rounding error. This vector is not actually computed in the p( $l$ )-CG algorithm. Instead, the computed basis vector $\bar{v}_{j+1}$ is calculated from the finite precision variant of relation (9) for $0\leq j<i-l$ , i.e.:

[TABLE]

where the size of the local rounding errors is bounded in terms of machine precision: $\|\xi^{\bar{v}}_{j+1}\|\leq(2\,\|\bar{z}_{j+1}\|/|\bar{g}_{j+1,j+1}|+3\sum_{k=j-2l+1}^{j}|\bar{g}_{k,j+1}|/|\bar{g}_{j+1,j+1}|\,\|\bar{v}_{k}\|)\epsilon$ . Let $\bar{V}_{j+1}=[\bar{v}_{0},\bar{v}_{1},\ldots,\bar{v}_{j}]$ and $\bar{\mathbb{V}}_{j+1}=[\bar{\mathbb{v}}_{0},\bar{\mathbb{v}}_{1},\ldots,\bar{\mathbb{v}}_{j}]$ . Relation (30) alternatively translates to the following formulation in matrix notation (with $1\leq j\leq i-l$ ):

[TABLE]

where $\bar{\Delta}_{j+1,j}$ is a $(j+1)$ -by- $j$ rectangular matrix holding the entries $\{\bar{\delta}_{0},\ldots,\bar{\delta}_{j-1}\}$ directly below the main diagonal. The matrix $(\bar{\mathbb{V}}_{j+1}-\bar{V}_{j+1})=[\bar{\mathbb{v}}_{0}-\bar{v}_{0},\bar{\mathbb{v}}_{1}-\bar{v}_{1},\ldots,\bar{\mathbb{v}}_{j}-\bar{v}_{j}]$ collects the gaps between the actual and recursively computed basis vectors, which quantify the deviation from the Lanczos relation in the finite precision setting. These gaps are crucial in describing the propagation of local rounding errors throughout the p( $l$ )-CG algorithm and are directly linked to the gap between the actual and recursively computed residuals, see expression (2.3.2). From (31) one obtains

[TABLE]

where $\Xi^{\bar{v}}_{j}=[0,\,\bar{g}_{1,1}\xi^{\bar{v}}_{1},\,\ldots,\,\bar{g}_{j-1,j-1}\xi^{\bar{v}}_{j-1}]$ collects the local rounding errors. The computed auxiliary vector $\bar{z}_{j+1}$ satisfies a finite precision version of the recurrence relation (6), which summarizes to

[TABLE]

where $\xi^{\bar{z}}_{j+1}$ is the local rounding error which can again be bounded in terms of machine precision $\epsilon$ . Expression (34) can be formulated in matrix notation as:

[TABLE]

Furthermore, the following matrix relations hold between the scalar coefficients $\bar{\gamma}_{j}$ and $\bar{\delta}_{j}$ in Alg. 1:

[TABLE]

Subsequently, using expressions (32), (33), (35) and (36) and it is derived that the gaps on the basis vectors are given by

[TABLE]

where $\bar{\Delta}_{j+1,j}^{+}=(\bar{\Delta}_{j+1,j}^{*}\bar{\Delta}_{j+1,j})^{-1}\bar{\Delta}_{j+1,j}^{*}$ should be interpreted as a Moore-Penrose (left) pseudo-inverse. Hence, the local rounding errors in this expression are possibly amplified by the entries of the matrix $\bar{G}^{-1}_{j}\bar{\Delta}^{+}_{j+1,j}$ , which may lead to loss of attainable accuracy for the p( $l$ )-CG method. The inexact Lanczos relation may in turn give rise to a growing gap between the computed and actual residual on the solution, see (2.3.2). It is clear from expression (37) that the conditioning of the matrix $\bar{G}^{-1}_{j}$ plays a crucial role in the rounding error propagation in the p( $l$ )-CG algorithm as indicated in Section 2.2, see Remark 4.

3.2 Toward stability by using the Lanczos relation

Section 3.1 shows that the recurrence relation (31) is the main cause for the amplification of local rounding errors throughout the p( $l$ )-CG algorithm. Moreover, the possibly ill-conditioned matrix $\bar{G}_{j}$ that is used construct the basis $\bar{V}_{j}$ , see expression (33), may be detrimental for convergence. A straightforward countermeasure would be to eliminate $\bar{G}_{j}$ in the construction of the basis. This can be achieved by simply replacing the recurrence relation (31) by the original Lanczos relation, i.e., for $0\leq j<i-l$ :

[TABLE]

Here $\psi^{\bar{v}}_{j+1}$ represents a local rounding error which is generally different from the error $\xi^{\bar{v}}_{j+1}$ occurring in expression (31). Recurrence relation (38) can alternatively be written as

[TABLE]

with $1\leq j\leq i-l$ . The gap between the true basis vector $\bar{\mathbb{v}}_{j+1}$ and the computed basis vector $\bar{v}_{j+1}$ then reduces to

[TABLE]

By using recurrence (38) for $\bar{v}_{j+1}$ instead of relation (31) in Alg. 1, no amplification of local rounding errors occurs, see (40), and the influence of rounding errors on attainable accuracy remains limited. However, to use the recurrence relation (38) an additional spmv, i.e. $A\bar{v}_{j}$ , is computed in each iteration of the algorithm, leading to an undesirable increase in computational cost. Although the use of expression (38) would not hinder the ability to overlap the global reduction phase with computations (for pipeline length $l$ the global reduction would simply be overlapped with $2l$ spmvs), we aim to avoid adding spmv computations to the algorithm.

The technique proposed by expression (38) shows similarity to the concept of residual replacement, which was suggested by several authors as a countermeasure to local rounding error propagation in various multi-term recurrence variants of CG [44, 3, 10]. While the idea is valuable, it cannot be implemented in the p( $l$ )-CG method in its current form, i.e. using expression (38), due to the significantly augmented computational cost caused by the additional spmv in each iteration.

4 Deriving stable recurrence relations

We now present the core technique for obtaining a numerically stable variant of the recurrence relations used in the p( $l$ )-CG algorithm by introducing additional auxiliary bases and corresponding recurrence relations. Sections 4.1-4.3 are again written in the exact arithmetic framework in order to derive the algorithm, interluded by a short discussion on computational costs and storage requirements in Section 4.2. We return to the finite precision framework for the analysis of the improved method in Section 4.4.

4.1 Derivation of a stable pipelined CG method

We introduce a total of $l+1$ bases, denoted by $Z^{(k)}_{i+1}$ , where the upper index ‘ $(k)$ ’ (with $0\leq k\leq l$ ) labels the different bases and the lower index ‘ $i+1$ ’ indicates the iteration like before. The basis $Z^{(0)}_{i+1}$ will denote the original Krylov subspace basis, that is: $Z^{(0)}_{i+1}=V_{i+1}$ . The auxiliary basis vectors $Z^{(l)}_{i+1}=[z^{(l)}_{0},z^{(l)}_{1},\ldots,z^{(l)}_{i}]$ are defined identically to the basis $Z_{i+1}$ in p( $l$ )-CG, cf. (4), i.e. $Z^{(l)}_{i+1}=Z_{i+1}$ . In addition, we also define $l-1$ intermediary bases $Z^{(1)}_{i+1},\ldots,Z^{(l-1)}_{i+1}$ that will enable us to use a variant of the Lanczos relation (38) to recursively update $v_{j}$ , but without the necessity to compute the spmv $Av_{j}$ . The auxiliary bases are defined as follows:

[TABLE]

where the polynomial is defined by (5). Note that the first $k+1$ vectors in basis $Z^{(k)}_{j}$ $(j\geq k)$ are identical to the first $k+1$ vectors in all bases $Z^{(k^{\prime})}_{j}$ with $k^{\prime}\geq k$ . By definition (41) the ‘zero-th’ basis $Z^{(0)}_{j}$ is simply the original basis $V_{j}$ , whereas the $l$ -th basis $Z^{(l)}_{j}$ is the auxiliary basis $Z_{j}$ from the p( $l$ )-CG method, see Section 2.

A crucial relation connects each pair of bases $Z^{(k)}_{j}$ and $Z^{(k+1)}_{j}$ (for $j>k$ ). It holds that

[TABLE]

which translates into

[TABLE]

By multiplying the original Lanczos relation (3) for $v_{j}$ on both sides by the respective polynomial $P_{k}(A)$ with $1\leq k\leq l$ and by exploiting the associativity of $A$ and $P_{k}(A)$ , it is straightforward to derive that each auxiliary basis $Z^{(k)}_{j}$ with $0\leq k\leq l$ satisfies a Lanczos type recurrence relation:

[TABLE]

for $j\geq k$ and $0\leq k\leq l$ . Note that when $k=0$ expression (44) yields the Lanczos relation (3) for $v_{j+1}$ , whereas setting $k=l$ results in the recurrence relation (6) for $z_{j+1}$ .

The recursive expressions (44) for the bases $Z^{(0)}_{j},\ldots,Z^{(l-1)}_{j}$ are not particularly useful in practice, since each recurrence relation requires to compute an additional spmv to form the next basis vector. However, using relation (43) the recurrence relations (44) can alternatively be written as:

[TABLE]

with $j\geq k$ and $0\leq k<l$ . We stress that only for $k=l$ , i.e. to compute the vectors in the auxiliary basis $Z^{(l)}_{j}=Z_{j}$ , we use the recursive update given by expression (44):

[TABLE]

for $j\geq l$ , which reduces to the recurrence relation (6). The recurrence relations (45) allow us to compute the vector updates for the bases $Z^{(0)}_{j},\ldots,Z^{(l-1)}_{j}$ without the need to compute any additional spmv. Adding the recurrence relations (45) for the auxiliary bases $Z^{(0)}_{j},\ldots,Z^{(l-1)}_{j}$ to the p( $l$ )-CG method leads to the stable p( $l$ )-CG method, Alg. 2.

Let us expound on Alg. 2 in some more detail. In the $i$ -th iteration of Alg. 2 each basis $Z^{(0)}_{j},Z^{(1)}_{j},\ldots,Z^{(l)}_{j}$ is updated by adding one vector. Thus the algorithm computes a total of $l+1$ new basis vectors, i.e.: $\{v_{i-l+1}=z^{(0)}_{i-l+1},z^{(1)}_{i-l+2},\ldots,z^{(l-1)}_{i},z^{(l)}_{i+1}=z_{i+1}\}$ , per iteration. For each basis, the corresponding vector update in iteration $i\geq l$ is computed as follows:

[TABLE]

Note that all vector updates make use of the same scalar coefficients $\gamma_{i-l}$ , $\delta_{i-l}$ and $\delta_{i-l-1}$ that are computed in iteration $i$ of Alg. 2 (lines 11-17). We also remark that only one spmv, namely $Az^{(l)}_{i}$ , is required per iteration to compute all basis vector updates, similar to Alg. 1.

The merit of the intermediate basis vector updates is that they allow to replace the recurrence (9) for the original basis vector $v_{j+1}$ by relation (44), which for $k=0$ yields:

[TABLE]

with $j\geq 0$ . Since relation (43) with $k=0$ states that $z^{(1)}_{j+1}+\sigma_{0}v_{j}=Av_{j}$ , expression (47) very closely resembles the finite precision Lanczos recurrence relation (38). In particular, we point out that the matrix $G_{j+1}^{-1}$ does not occur in the recursive update (47) for $v_{j+1}$ . We clarify the difference between the finite precision and exact variant of recurrence relation (47) in Section 4.4.

Example 1.

To illustrate the methodology for constructing the basis in the stable p( $l$ )-CG method, consider the case where the pipeline length $l=3$ . We formulate the improved p( $3$ )-CG method following the derivation above. This scenario features the default Krylov subspace basis $Z_{j}^{(0)}=V_{j}$ and three auxiliary bases: $Z_{j}^{(1)},Z_{j}^{(2)}$ and $Z_{j}^{(3)}=Z_{j}$ . To compute a new vector for each auxiliary basis, the following recurrence relations are used by Alg. 2 in iteration $i\geq 3$ :

[TABLE]

The final recurrence relation to update $z^{(3)}_{i+1}=z_{i+1}$ is identical to expression (6) with $l=3$ . The former recurrence relation for $z^{(0)}_{i-2}=v_{i-2}$ , expression (9), is replaced by the (stable) relation (47). This update explicitly uses the auxiliary variable $z^{(1)}_{i-2}$ and implicitly depends on the other $l-1=2$ auxiliary variables $z^{(2)}_{i-2}$ and $z^{(3)}_{i-2}$ through the respective recurrence relations. All four recurrence relations above make use of the same scalar coefficients $\delta_{i-3},\gamma_{i-3}$ and $\delta_{i-4}$ that form the last column of the matrix $T_{i-1,i-2}$ . Similar to Alg. 1 these coefficients are computed on line 11-17 in Alg. 2, right before the recursive vector updates.

Example 2.

In the case where the pipeline length is one, i.e. $l=1$ , the stable p( $l$ )-CG Alg. 2 formally differs slightly from the original formulation of p( $1$ )-CG, Alg. 1. The improved algorithm uses the following recurrence relations for $v_{i}$ and $z_{i+1}$ in iteration $i\geq 1$ :

[TABLE]

where $z^{(0)}_{i}=v_{i}$ and $z^{(1)}_{i+1}=z_{i+1}$ . This implies that the only difference between the improved and original p( $1$ )-CG method is the recurrence relation for $v_{i}$ . The recurrence relation for $v_{i}$ above is equivalent to the recurrence relation (9) in exact arithmetic, but it is numerically (more) stable as explained in Section 4.4.

4.2 Computational costs and storage requirements

We give an overview of implementation details of the stable p( $l$ )-CG method, Alg. 2, including global storage requirements and number of flops (floating point operations) per iteration. We compare to the same properties for the former version of p( $l$ )-CG Alg. 1 [11] and Ghysels’ p-CG method [19]. The latter algorithm, although mathematically equivalent to p( $1$ )-CG, was derived in an essentially different way.

4.2.1 Floating point operations per iteration

All Conjugate Gradient variants listed in Table I compute a single spmv in each iteration. However, as indicated by the Time column, time per iteration may be reduced significantly by overlapping the global reduction phase with the computation of one or multiple spmvs. Time required by the local axpy and dot-pr computations is neglected, since these operations are assumed to scale perfectly as a function of the number of workers.

Comparing Alg. 1 to Alg. 2, it is clear that the latter requires an additional $2(l-1)$ axpys per iteration to update the auxiliary vectors $z^{(1)}_{i-l+2},\ldots,z^{(l-1)}_{i}$ using the recurrence relations (45). However, the recurrence relation to update $v_{i-l+1}$ , expression (47), only requires $2$ axpy operations, instead of the $2l$ axpys required to update $v_{i-l+1}$ in Alg. 1 using expression (9). Both algorithms furthermore compute $(l+1)$ local dot products (see Remark 2) to form the $G_{j}$ matrix and use two additional axpy operations to update the search direction $p_{i-l}$ and the iterate $x_{i-l}$ . In summary, as indicated by the Flops column in Table I, both p( $l$ )-CG algorithms use a total of $(6l+10)N$ flops in each iteration. The recurrence relations in the p( $l$ )-CG algorithm can thus be stabilized at no additional computational cost using the framework outlined in Section 4.1.

4.2.2 Global storage requirements

Section 4.4 proves that the stable p( $l$ )-CG method, Alg. 2, is extremely resilient to the presence of local rounding errors in the recurrence relations. However, this stability comes at a slightly increased storage cost compared to Alg. 1. The latter requires to store $2l+1$ vectors of the $V_{j}$ basis (required for vector updates), $l+1$ vectors of the auxiliary basis $Z_{j}$ (for vector updates and dot product computations, see also Remark 2), and the vector $p_{i-l}$ at any time during the execution of the algorithm from iteration $i\geq 2l-1$ onward. In contrast, Alg. 2 stores the three most recently updated vectors in each of the bases $Z^{(0)}_{j},\ldots,Z^{(l)}_{j}$ (which include the bases $V_{j}$ and $Z_{j}$ ). In addition, $l$ vectors in the $Z_{j}$ basis need to be stored for dot product computations. Table I summarizes the storage requirements for different variants of the CG algorithm. Alg. 1 keeps a total of $3l+2$ vectors in memory at any time during execution, whereas Alg. 2 stores $4l+1$ vectors. The memory overhead for the stable p( $l$ )-CG method thus amounts to a modest $l-2$ vectors.

4.3 A stable preconditioned p( $l$ )-CG algorithm

Preconditioning is indispensable to efficiently solve linear systems in practice. We briefly comment on the straightforward extension of Alg. 2 to include a preconditioner. This section follows the standard methodology that was described for pipelined CG in [19] and for Alg. 1 in [11].

Let the preconditioner be given by the matrix $M^{-1}$ . We aim to solve the left-preconditioned linear system $M^{-1}Ax=M^{-1}b$ , where $M$ and $A$ are both symmetric positive definite matrices. This assumption does not necessarily imply that $M^{-1}A$ is symmetric. Nonetheless, symmetry can be preserved by observing that $M^{-1}A$ is self-adjoint with respect to the $M$ inner product $(x,y)_{M}=(Mx,y)=(x,My)$ . The basic strategy is thus to replace all Euclidean dot products occurring in Alg. 2 with $M$ inner products.

We cannot simply use the matrix $M$ to calculate these $M$ inner products, since the preconditioner inverse is in general not available. However, by introducing the unpreconditioned auxiliary variables $u_{i}=Mz^{(l)}_{i}$ and observing that these variables again satisfy a Lanczos type relation:

[TABLE]

this obstacle is circumvented. Using these unpreconditioned auxiliary variables $u_{i+1}$ , the dot products $g_{j,i+1}$ for $0\leq j\leq i+1$ can be computed as follows. For $i<l$ it holds that

[TABLE]

with $j=0,\ldots,i+1$ . For $i\geq l$ we find

[TABLE]

This allows to formulate a preconditioned version of Alg. 2 by adding the recurrence relation (48) for the unpreconditioned auxiliary variables $u_{i}$ , and replacing the dot products on line $23$ and $25$ by expressions (49) and (50) respectively. From an implementation point of view the extension to the preconditioned algorithm only requires the application of the preconditioner $M^{-1}$ , two additional axpy operations and storage of three additional vectors in memory.

4.4 Rounding error analysis for the improved method

We now consider the finite precision equivalents of the above recurrence relations for all basis vectors in the improved version of p( $l$ )-CG, Alg. 2. The actually computed finite precision variants of the basis vectors and scalar variables are again denoted by bars.

Let the true basis vector $\bar{\mathbb{v}}_{j}$ satisfy the actual Lanczos relation without local roundoff (for $0\leq j<i-l$ )

[TABLE]

and let, analogously, the true auxiliary basis vectors be defined (for $j\geq k$ and $0\leq k\leq l$ ) as

[TABLE]

On the other hand, the finite precision variants of the recurrence relations (45), which are actually computed in Alg. 2, are for $j\geq k$ and $0\leq k<l$ given by the expressions

[TABLE]

whereas for $k=l$ the following finite precision relation holds (with $j\geq l$ ):

[TABLE]

The local rounding errors $\xi^{(k)}_{j+1}$ $(0\leq k\leq l)$ in the above expressions can be bounded by machine precision $\epsilon$ multiplied by the norms of the respective vectors, i.e. $\|\xi^{(k)}_{j+1}\|\leq\mathcal{O}(\epsilon)$ .

Switching to matrix forms to simplify the notation, we write the Lanczos relations (52) as

[TABLE]

for $j>k$ and $0\leq k\leq l$ . Note that this expression neglects the “initial” gaps, i.e. the quantities $\bar{\mathbb{z}}^{(k)}_{j}-\bar{z}^{(k)}_{j}$ for $0\leq j\leq k$ , consisting of the local rounding errors from the application of the matrix polynomials $P_{j}(A)$ to the initial vector $v_{0}$ , which are computed explicitly in the stable p( $l$ )-CG method, Alg. 2 (line 3). The recurrence relations (53) are summarized by the matrix expressions

[TABLE]

with $j>k$ and $0\leq k<l$ , where $\Xi^{(k)}_{j+1}=[\xi^{(k)}_{0},\ldots,\xi^{(k)}_{j}]$ and $\bar{Z}^{(k+1)}_{2:j+1}=[\bar{z}^{(k+1)}_{1},\bar{z}^{(k+1)}_{2},\ldots,\bar{z}^{(k+1)}_{j}]$ . The recurrence relation (54) for $\bar{z}^{(l)}_{j+1}$ can be written as

[TABLE]

with $j>l$ . To find an expression for the gap on the basis vectors in $\bar{V}_{j}$ , i.e. the gap $\bar{\mathbb{Z}}^{(0)}_{j}-\bar{Z}^{(0)}_{j}$ , we now progressively compute the gaps on the auxiliary bases. Starting from $\bar{\mathbb{Z}}^{(l)}_{j}-\bar{Z}^{(l)}_{j}$ , we compute $\bar{\mathbb{Z}}^{(l-1)}_{j}-\bar{Z}^{(l-1)}_{j}$ , in which we substitute the gap on $\bar{Z}^{(l)}_{j}$ , followed by $\bar{\mathbb{Z}}^{(l-2)}_{j}-\bar{Z}^{(l-2)}_{j}$ , etc., until we eventually obtain an expression for the gap $\bar{\mathbb{Z}}^{(0)}_{j}-\bar{Z}^{(0)}_{j}$ . For $k=l$ we derive from (55) and (57) that

[TABLE]

As indicated by this expression, the gaps on the basis vectors in the basis $\bar{Z}^{(l)}_{j}$ thus consist of local rounding errors only. The following auxiliary lemma can easily be proven by induction.

Lemma 1.

Let $\bar{Z}^{(k)}_{j}$ be defined by (56) and $\bar{\mathbb{Z}}^{(k+1)}_{j+1}$ be defined by (55). Then it holds for $j>k$ and $0\leq k<l$ that

[TABLE]

where $\Theta^{(k+1)}_{2:j+1}=[\theta^{(k+1)}_{1},\theta^{(k+1)}_{2},\ldots,\theta^{(k+1)}_{j}]$ with $\theta^{(k+1)}_{j}$ a local rounding error that is bounded by $\mathcal{O}(\epsilon)$ and $\bar{\mathbb{Z}}^{(k+1)}_{2:j+1}=[\bar{\mathbb{z}}^{(k+1)}_{1},\bar{\mathbb{z}}^{(k+1)}_{2},\ldots,\bar{\mathbb{z}}^{(k+1)}_{j}]$ .

Next, we combine expressions (55) and (56) and Lemma 1 for the case $k=l-1$ . We obtain (with $j>l-1$ )

[TABLE]

Hence, the gaps on the basis vectors $\bar{Z}^{(l-1)}_{j}$ are coupled to the gaps on the basis vectors $\bar{Z}^{(l)}_{j}$ . After substitution of expression (58) it is clear that $\bar{\mathbb{Z}}^{(l-1)}_{j+1}-\bar{Z}^{(l-1)}_{j+1}$ consists only of local rounding errors. The above relation can be generalized for any $k\in\{0,1,\ldots,l-1\}$ as follows:

[TABLE]

where $j>k$ . After subsequent substitution of this expression starting from $\bar{\mathbb{Z}}^{(0)}_{j+1}-\bar{Z}^{(0)}_{j+1}$ up until the characterization of $\bar{\mathbb{Z}}^{(l)}_{j+1}-\bar{Z}^{(l)}_{j+1}$ , see (58), it appears that the gap $\bar{\mathbb{Z}}^{(0)}_{j+1}-\bar{Z}^{(0)}_{j+1}=\bar{\mathbb{V}}_{j+1}-\bar{V}_{j+1}$ on the original Krylov subspace basis is just a sum of local rounding errors. No rounding error propagation takes place in the stable p( $l$ )-CG method, Alg. 2, on any of the (auxiliary) basis vectors, see expressions (58)-(4.4). By introducing the intermediate auxiliary bases $\bar{Z}^{(1)}_{j},\ldots,\bar{Z}^{(l-1)}_{j}$ for the recursive computation of the basis $\bar{V}_{j}$ , the dependency of the basis vector gaps on the possibly ill-conditioned matrix $\bar{G}^{-1}_{j}$ , cf. expressions (33) and (37), is thus removed, resulting in a numerically stable algorithm.

5 Numerical results

We present various numerical experiments to benchmark the stable p( $l$ )-CG method proposed in Section 4. The benchmark problems exemplify both the performance of the improved p( $l$ )-CG method on large scale parallel hardware as well as its error resilience compared to other CG variants. Performance measurements result from a PETSc [1] implementation of the p( $l$ )-CG algorithm on a distributed memory machine using the message passing paradigm (MPI).

5.1 Hardware and software specifications

Parallel performance experiments are performed on up to 128 compute nodes of a cluster consisting of two 14-core Intel E5-2680v4 Broadwell generation CPUs each (28 cores per node). Nodes are connected through an EDR InfiniBand network. We use PETSc version 3.8.3 [1]. The MPI library used for this experiment is Intel MPI 2018v3. The PETSc environment variables MPICH_ASYNC_PROGRESS=1 and MPICH_MAX_THREAD_SAFETY=multiple ensure optimal parallelism by allowing for asynchronous non-blocking global communication. Timing results reported in this manuscript are the most favorable results (smallest overall run-time) over 3 individual runs of each method. Experiments also show results for Ghysels’ p-CG method [19] as a reference. The p-CG method is similar to p( $1$ )-CG in operational cost (see Table I), but features a significant loss of attainable accuracy due to rounding error propagation in its recurrence relations [10], similar to p( $l$ )-CG, Alg. 1.

5.2 Benchmark (B1): 2D Laplace PDE

Fig. 2 is the analogue of the experiment reported in Fig. 1 where the p( $l$ )-CG Alg. 1 is replaced by the improved Alg. 2. The model problem solved is the 2D Laplace equation

[TABLE]

with homogeneous Dirichlet boundary conditions, discretized using second order finite differences on a uniformly spaced $100\times 100$ grid. In contrast to Fig. 1(a), no loss of accuracy due to local rounding error amplification in the recurrence relations is observed in Fig. 2(a). The stabilized recurrences (45)-(46) ensure that the quantity $\|(A\bar{V}_{i}-\bar{V}_{i+1}\bar{T}_{i+1,i})\bar{U}_{i}^{-1}\bar{q}_{i}\|$ is of order machine precision. The true residual norms (full lines) and recursively computed residual norms (dotted lines) coincide up to $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.0$ e- $12$ (maximal attainable accuracy) for all methods in Fig. 2(a). Fig. 2(b) quantifies the inexact orthogonality for the different methods which is comparable to Fig. 1(b). Hence, similarly to Fig. 1(a) a delay of convergence may be observed in Fig. 2(a).

Fig. 3 shows a performance experiment on the hardware and software setup specified above. A linear system resulting from discretization of the 2D Laplace equation (61) with exact solution $\hat{x}_{j}=1$ , right-hand side $b=A\hat{x}$ and initial guess $x_{0}=0$ is solved. This problem is available as example $2$ in the PETSc Krylov subspace solvers (KSP) folder. The simulation domain is discretized using a $1,750\times 1,750$ uniform finite difference mesh ( $3,062,500$ unknowns). No preconditioner is applied. For p( $l$ )-CG Chebyshev shifts $\{\sigma_{0},\ldots,\sigma_{l-1}\}$ are used based on the interval $[\lambda_{\min},\lambda_{\max}]=[0,8]$ (known analytically), see (24).

Fig. 3(a) shows the speedup achieved by different CG methods over single-node classic CG for various pipeline lengths and node setups. Classic CG scales poorly for this model problem; no speedup is achieved beyond $8$ nodes. The pipelined methods scale well. The length one p-CG and p( $1$ )-CG method achieve a relative speedup of approximately $5\times$ compared to classic CG when both are executed on $16$ nodes. The longer pipelined p( $2$ )-CG and p( $3$ )-CG methods out-scale the latter method, with p( $2$ )-CG achieving a $7\times$ speedup relative to classic CG on $32$ nodes. When $l=1$ the global communication phase in each iteration is only partially ‘hidden’ behind the spmv computation, whereas overlapping with more than two spmvs by using pipelines length $l\geq 3$ does not seem to improve performance further. Pipeline length $l=2$ is optimal for this problem, striking a good balance between overlapping communication and introducing additional auxiliary vectors.

Fig. 3(b) plots the relative residual norms as a function of the total time spent (in s.) by various CG algorithms (on 500 iteration intervals). The p-CG method stagnates around $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.0$ e- $8$ and is unable to attain a better accuracy regardless of computational effort. The stable p( $l$ )-CG methods all are able to attain a much higher accuracy, stagnating around $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=2.7$ e- $13$ for $l\in\{1,2,3,4,5\}$ . Maximal attainable accuracy is reached in 2.01 s. for p( $1$ )-CG, 1.28 s. for p( $2$ )-CG, 1.34 s. for p( $3$ )-CG, 1.74 s. for p( $4$ )-CG, 2.73 s. for p( $5$ )-CG. Note that classic CG attains $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=2.5$ e- $13$ in $13.9$ s. (outside the graph). These results are consistent with Fig. 3(a), indicating that a pipeline length of $l=2$ suffices to hide the global communication phase for this problem.

5.3 Benchmark (B2): 3D Hydrostatic Ice Sheet Flow

Fig. 4(a) shows the result of two strong scaling experiments for the 3D Hydrostatic Ice Sheet Flow problem, see [2] for a full problem description. An implementation of this problem is available as example $48$ is the PETSc Scalable Nonlinear Equations Solvers (SNES) folder. The Blatter/Pattyn equations are discretized using $100\times 100\times 20$ (Fig. 4(a)) and $150\times 150\times 100$ (Fig. 4(b)) finite elements respectively. Hard- and software specifications are presented in Section 5.1. A Newton-Krylov outer-inner iteration is used to solve the non-linear problem. The CG methods used as inner solver are combined with a block Jacobi preconditioner333Note: To attain good parallel scaling using pipelined Krylov methods it is generally beneficial to choose a preconditioner which does not require global communication. We hence use (block) stationary iterative methods as preconditioners that only use spmvs and axpys but avoid computing dot products in both examples (B2) and (B3). (one block Jacobi step per CG iteration; one block per processor). The relative tolerance of the inner solvers is set to $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.0$ e- $10$ (Fig. 4(a)) and $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.0$ e- $5$ (Fig. 4(b)), while the outer relative tolerance is chosen to be $1.0$ e- $8$ . Seven Newton iterations are needed to reach the outer tolerance. Chebyshev shifts are based on the interval $[\lambda_{\min},\lambda_{\max}]=[0,2]$ , which is chosen in an ad hoc fashion for this problem based on the presumed clustering of the spectrum around $1$ .

In Fig. 4(b) the scalability of p( $l$ )-CG for $l\in\{1,2,3\}$ is comparable to that of the p-CG method. On $128$ nodes a speedup of approximately $10\times$ is measured for the pipelined methods compared to classic CG on $128$ nodes. There is no gain in using longer pipeline lengths for this problem and hardware setup, indicating that the amount of computational work of a single iteration (spmv + prec) suffices to hide the communication in the global reduction in each iteration. It is expected that longer pipeline lengths out-scale the methods with shorter pipelines on very large numbers of nodes, since communication costs would increase accordingly. This is illustrated to some extend by the smaller problem reported in Fig. 4(a), which shows that for heavily communication bound problems the use of longer pipelines is beneficial for this benchmark. Fig. 4(a) indicates that depending on the amount of hardware parallelism (number of nodes) p( $1$ )-CG (4-24 nodes), p( $2$ )-CG (32-40 nodes) or p( $3$ )-CG (48+ nodes) respectively display the biggest speedup.

An accuracy experiment for the $150\times 150\times 100$ FE benchmark problem is shown in Fig. 4(c). The experiment is run on $8$ nodes and the relative tolerance of the inner solver is $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.0$ e- $10$ . A relatively small delay of convergence is observed for the p( $l$ )-CG method, with the effect worsening for longer pipelines. The convergence delay is due to restarts (caused by a square root breakdown) which come forth from the ill-conditioned auxiliary basis $\bar{Z}_{j}$ . The effect is negligible since the maximal total delay after seven Newton iterations is $174$ iterations (comparing p( $3$ )-CG to CG), relative to a total of $\sim 4,500$ Krylov iterations. The p-CG method fails to reach the inner tolerance $1.0$ e- $10$ . As shown by the analysis in Section 4.4, no rounding error propagation occurs in the stable p( $l$ )-CG algorithm and the method is able to attain a highly accurate solution.

5.4 Benchmark (B3): 2D Bratu Solid Fuel Ignition

The 2D Bratu Solid Fuel Ignition problem results from a finite difference discretization of the nonlinear PDE

[TABLE]

with homogeneous Dirichlet boundary conditions on a uniformly spaced 2D grid. Its implementation can be found in the PETSc SNES folder as example 5. The two-dimensional domain is discretized with $1,000\times 1,000$ uniformly spaced grid points. The Bratu parameter $\lambda_{B}$ is set to $6.0$ , implying significant non-linearity. Different CG methods are used as the inner solver in the Newton-Krylov scheme. The preconditioner is a SOR-preconditioned Chebyshev iteration (three SOR steps per Chebyshev iteration; one Chebyshev step per CG iteration). Outer solver relative tolerance is set to $1.0$ e- $8$ , which is attained by all methods after three Newton steps. For p( $l$ )-CG Chebyshev shifts based on the interval $[\lambda_{\min},\lambda_{\max}]=[0,1.2]$ are used. A relative tolerance of $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.0$ e- $10$ is imposed on the inner solvers.

Fig. 5(a) presents the relative speedup of the p( $l$ )-CG methods for $l\in\{1,2,3\}$ compared to classic CG on one node. Timing data for the p-CG method is not included in the figure since it is unable to achieve the desired accuracy; the maximal accuracy attainable by p-CG is $\|b-A\bar{x}_{i}\|_{2}/\|b\|_{2}=1.3$ e- $9$ / $4.2$ e- $8$ / $6.4$ e- $8$ for outer SNES iteration $1/2/3$ respectively. On $32$ nodes the speedup achieved by the pipelined methods over classic CG is approximately $4\times$ . For this small sized problem the p( $l$ )-CG method does not scale beyond 32 nodes since CPU times are dominated by the $\mathcal{O}(\log_{2}($ #nodes $))$ behavior of the global reduction phases in this regime. Longer pipelines do not yield a higher speedup compared to pipeline length $1$ due to the cost of the preconditioner, which requires two spmvs each iteration.

Fig. 5(b) shows the relative residual norm as a function of the total time spent for solving the three Newton iterations on a $24$ node setup. The timings and residual norms are taken from the same experiment as Fig. 5(a). The total time-to-solution is divided by the respective number of iterations for the different CG methods to accurately compare them for accurate comparison on a single graph. The p( $l$ )-CG methods outperform the classic CG method despite a small increase in total number of iterations for longer pipeline lengths. The stable p( $l$ )-CG algorithm reaches the relative tolerance $1.0$ e- $10$ for any choice of the parameter $l$ .

6 Conclusions

This work presents a redesigned algorithmic variant of the $l$ -length pipelined Conjugate Gradient method, p( $l$ )-CG for short. The main improvement over former pipelined CG variants is the significantly improved maximal attainable accuracy that is attained by the new algorithm. More specifically, it is shown analytically and verified experimentally that the stable p( $l$ )-CG method attains the same precision on the solution that is attainable by the classic CG method. By introducing intermediate auxiliary bases the propagation of local rounding errors in the recurrence relations is eliminated, allowing for high-precision solution independently of the choice of the pipeline length $l$ . The new p( $l$ )-CG algorithm is elegant in the sense that it has no additional computational overhead and only minor additional storage requirements compared to previous versions of the p( $l$ )-CG algorithm. The increased stability thus comes without the cost of increased complexity. The improved algorithm effectively replaces former (less stable) pipelined CG variants. As such, this work resolves one of the major numerical issues that has restricted the practical usability of pipelined Krylov subspace methods since their initial development.

Generalizing the stabilization technique proposed in this work to other pipelined methods is non-trivial. A similar methodology could be applied to the existing pipelined GMRES method, p( $l$ )-GMRES [18]. However, practical restrictions limit the viability of this approach. Notably, the full storage of an additional $(l-1)$ auxiliary bases would be required, which can be assumed to be unfeasible for large-scale applications. An $l$ -length variant of the pipelined BiCGStab method [9] is currently not available, but the proposed technique could be promising for direct application in the context of Bi-Conjugate Gradient methods.

The research presented in this manuscript provides vital advancements towards establishing a numerically stable communication hiding variant of the Conjugate Gradient method. However, it is well-known that Krylov subspace methods may also suffer from delay of convergence due to loss of basis orthogonality. Our experiments indicate that this effect is typically enlarged as a function of pipeline length. Analyzing the stability issues related to loss of orthogonality deserves to be treated as part of future work.

Acknowledgments

The authors are grateful for the funding received that supported this work. In particular, J. C. acknowledges funding by the University of Antwerp Research Council under the University Research Fund (BOF) and S. C. is funded by the Flemish Research Foundation (FWO Flanders) under grant 12H4617N. We also cordially thank Pieter Ghysels for providing useful comments on an earlier version of this manuscript.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman Mc Innes, K. Rupp, B.F. Smith, S. Zampini, and H. Zhang. PET Sc Web page. www.mcs.anl.gov/petsc , 2015.
2[2] J. Brown, B. Smith, and A. Ahmadia. Achieving textbook multigrid efficiency for hydrostatic ice sheet flow. SIAM Journal on Scientific Computing , 35(2):B 359–B 375, 2013.
3[3] E. Carson and J. Demmel. A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods. SIAM Journal on Matrix Analysis and Applications , 35(1):22–43, 2014.
4[4] E. Carson, N. Knight, and J. Demmel. Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods. SIAM Journal on Scientific Computing , 35(5):S 42–S 61, 2013.
5[5] E. Carson, M. Rozlozník, Z. Strakoš, P. Tichỳ, and M. Tma. The numerical stability analysis of pipelined Conjugate Gradient methods: Historical context and methodology. SIAM Journal on Scientific Computing , 40(5):A 3549–A 3580, 2018.
6[6] A.T. Chronopoulos and C.W. Gear. s-Step iterative methods for symmetric linear systems. Journal of Computational and Applied Mathematics , 25(2):153–168, 1989.
7[7] A.T. Chronopoulos and A.B. Kucherov. Block s-step Krylov iterative methods. Numerical Linear Algebra with Applications , 17(1):3–15, 2010.
8[8] S. Cools. Analyzing and improving maximal attainable accuracy in the communication hiding pipelined Bi CG Stab method. Parallel Computing, PMAA’18 Special Issue, accepted for publication, in press , 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method

Abstract

Index Terms:

1 Introduction

2 Deep pipelined Conjugate Gradients

2.1 Basis recurrence relations in exact arithmetic

Remark 1**.**

Remark 2**.**

2.2 *On the conditioning of the auxiliary basis *

Remark 3**.**

Remark 4**.**

2.3 Loss of orthogonality and attainable accuracy

2.3.1 Loss of orthogonality

2.3.2 The inexact Lanczos relation

2.3.3 Scope and limitations of this manuscript

3 Analyzing rounding error propagation

3.1 *Local rounding error behavior in finite precision *

3.2 *Toward stability by using the Lanczos relation *

4 Deriving stable recurrence relations

4.1 Derivation of a stable pipelined CG method

Example 1**.**

Example 2**.**

4.2 Computational costs and storage requirements

4.2.1 Floating point operations per iteration

4.2.2 Global storage requirements

4.3 A stable preconditioned p(lll)-CG algorithm

4.4 Rounding error analysis for the improved method

Lemma 1**.**

5 Numerical results

5.1 Hardware and software specifications

5.2 Benchmark (B1): 2D Laplace PDE

5.3 Benchmark (B2): 3D Hydrostatic Ice Sheet Flow

5.4 Benchmark (B3): 2D Bratu Solid Fuel Ignition

6 Conclusions

Acknowledgments

Remark 1.

Remark 2.

2.2 On the conditioning of the auxiliary basis

Remark 3.

Remark 4.

3.1 Local rounding error behavior in finite precision

3.2 Toward stability by using the Lanczos relation

Example 1.

Example 2.

4.3 A stable preconditioned p( $l$ )-CG algorithm

Lemma 1.