On the Convergence Rate of Variants of the Conjugate Gradient Algorithm   in Finite Precision Arithmetic

Anne Greenbaum; Hexuan Liu; Tyler Chen

arXiv:1905.05874·cs.NA·July 19, 2021

On the Convergence Rate of Variants of the Conjugate Gradient Algorithm in Finite Precision Arithmetic

Anne Greenbaum, Hexuan Liu, Tyler Chen

PDF

TL;DR

This paper investigates how three variants of the conjugate gradient algorithm perform in finite precision arithmetic, revealing that their convergence behavior varies depending on eigenvalue distribution and rounding error sensitivity.

Contribution

It analyzes the finite precision convergence of three mathematically equivalent CG variants, linking their performance to eigenvalue interval widths and rounding error conditions.

Findings

01

Different CG variants behave differently depending on eigenvalue distribution.

02

Convergence in finite precision correlates with how well conditions for exact CG are satisfied.

03

Ultimate accuracy levels vary among the variants based on problem sensitivity to rounding errors.

Abstract

We consider three mathematically equivalent variants of the conjugate gradient (CG) algorithm and how they perform in finite precision arithmetic. It was shown in [{\em Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences}, Lin.~Alg.~Appl., 113 (1989), pp.~7-63] that under certain conditions the convergence of a slightly perturbed CG computation is like that of exact CG for a matrix with many eigenvalues distributed throughout tiny intervals about the eigenvalues of the given matrix, the size of the intervals being determined by how closely these conditions are satisfied. We determine to what extent each of these variants satisfies the desired conditions, using a set of test problems and show that there is significant correlation between how well these conditions are satisfied and how well the finite precision computation converges before reaching its ultimately…

Equations110

ℓ \sum ⟨ \overset{q}{^}_{1}, v_{i ℓ} ⟩^{2} \approx ⟨ q_{1}, v_{i} ⟩^{2} .

ℓ \sum ⟨ \overset{q}{^}_{1}, v_{i ℓ} ⟩^{2} \approx ⟨ q_{1}, v_{i} ⟩^{2} .

A Q_{J} = Q_{J} T_{J} + β_{J} q_{J + 1} ξ_{J}^{T} + F_{J},

A Q_{J} = Q_{J} T_{J} + β_{J} q_{J + 1} ξ_{J}^{T} + F_{J},

T_{J}=\left[\begin{array}[]{ccccc}\alpha_{1}&\beta_{1}&&&\\ \beta_{1}&\ddots&\ddots&&\\ &\ddots&\ddots&\beta_{J-1}&\\ &&\beta_{J-1}&\alpha_{J}\end{array}\right],

T_{J}=\left[\begin{array}[]{ccccc}\alpha_{1}&\beta_{1}&&&\\ \beta_{1}&\ddots&\ddots&&\\ &\ddots&\ddots&\beta_{J-1}&\\ &&\beta_{J-1}&\alpha_{J}\end{array}\right],

∥ f_{j} ∥ \leq ϵ ∥ A ∥,

∥ f_{j} ∥ \leq ϵ ∥ A ∥,

∣ ⟨ β_{j} q_{j + 1}, q_{j} ⟩ ∣ \leq ϵ ∥ A ∥.

∣ ⟨ β_{j} q_{j + 1}, q_{j} ⟩ ∣ \leq ϵ ∥ A ∥.

x_{J} = x_{0} + Q_{J} T_{J}^{- 1} β ξ_{1} .

x_{J} = x_{0} + Q_{J} T_{J}^{- 1} β ξ_{1} .

r_{k}

r_{k}

q_{k + 1} = a_{k - 1} \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} A q_{k} - (1 + \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}}) \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} q_{k} - \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k} ∥} q_{k - 1},

q_{k + 1} = a_{k - 1} \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} A q_{k} - (1 + \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}}) \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} q_{k} - \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k} ∥} q_{k - 1},

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k - 1} ∥} q_{k - 1} .

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k - 1} ∥} q_{k - 1} .

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{∥ r _{k - 1} ∥}{a _{k - 2} ∥ r _{k - 2} ∥} q_{k - 1} .

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{∥ r _{k - 1} ∥}{a _{k - 2} ∥ r _{k - 2} ∥} q_{k - 1} .

A Q_{J} = Q_{J} T_{J} + β_{J} q_{J + 1} ξ_{J}^{T},

A Q_{J} = Q_{J} T_{J} + β_{J} q_{J + 1} ξ_{J}^{T},

\frac{∥ r _{k} ∥}{∥ r _{0} ∥} \leq κ^{1/2} 2 (\frac{κ - 1}{κ + 1})^{k},

\frac{∥ r _{k} ∥}{∥ r _{0} ∥} \leq κ^{1/2} 2 (\frac{κ - 1}{κ + 1})^{k},

\frac{⟨ r _{k} , A ^{- 1} r _{k} ⟩ ^{1/2}}{⟨ r _{0} , A ^{- 1} r _{0} ⟩ ^{1/2}} \sim < 2 (\frac{κ - 1}{κ + 1})^{k},

\frac{⟨ r _{k} , A ^{- 1} r _{k} ⟩ ^{1/2}}{⟨ r _{0} , A ^{- 1} r _{0} ⟩ ^{1/2}} \sim < 2 (\frac{κ - 1}{κ + 1})^{k},

{p_{k} : p_{k} (0) = 1} min z \in \cup_{i = 1}^{n} [λ_{i} - δ, λ_{i} + δ] max ∣ p_{k} (z) ∣

{p_{k} : p_{k} (0) = 1} min z \in \cup_{i = 1}^{n} [λ_{i} - δ, λ_{i} + δ] max ∣ p_{k} (z) ∣

ℓ \sum ⟨ \overset{r}{^}_{0}, v_{i_{ℓ}} ⟩^{2} \approx ⟨ r_{0}, v_{i} ⟩^{2}, i = 1, \dots, n,

ℓ \sum ⟨ \overset{r}{^}_{0}, v_{i_{ℓ}} ⟩^{2} \approx ⟨ r_{0}, v_{i} ⟩^{2}, i = 1, \dots, n,

r_{k}

r_{k}

p_{k}

∥ α v - \mbox f l [α v] ∥

∥ α v - \mbox f l [α v] ∥

∥ v + w - \mbox f l [v + w] ∥

∥ A v - \mbox f l [A v] ∥

∥ δ_{r_{k}} ∥

∥ δ_{r_{k}} ∥

∥ δ_{p_{k}} ∥

a_{k - 1} = \frac{⟨ r _{k - 1} , r _{k - 1} ⟩}{⟨ p _{k - 1} , A p _{k - 1} ⟩}, b_{k} = \frac{⟨ r _{k} , r _{k} ⟩}{⟨ r _{k - 1} , r _{k - 1} ⟩},

a_{k - 1} = \frac{⟨ r _{k - 1} , r _{k - 1} ⟩}{⟨ p _{k - 1} , A p _{k - 1} ⟩}, b_{k} = \frac{⟨ r _{k} , r _{k} ⟩}{⟨ r _{k - 1} , r _{k - 1} ⟩},

r_{k}

r_{k}

γ_{k} = \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}} δ_{r_{k - 1}} + a_{k - 1} A δ_{p_{k - 1}} - δ_{r_{k}} .

γ_{k} = \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}} δ_{r_{k - 1}} + a_{k - 1} A δ_{p_{k - 1}} - δ_{r_{k}} .

q_{k + 1} = a_{k - 1} \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} A q_{k} - (1 + \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}}) \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} q_{k} - \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k} ∥} q_{k - 1} - \frac{( - 1 ) ^{k}}{∥ r _{k} ∥} γ_{k},

q_{k + 1} = a_{k - 1} \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} A q_{k} - (1 + \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}}) \frac{∥ r _{k - 1} ∥}{∥ r _{k} ∥} q_{k} - \frac{a _{k - 1} b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k} ∥} q_{k - 1} - \frac{( - 1 ) ^{k}}{∥ r _{k} ∥} γ_{k},

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k - 1} ∥} q_{k - 1} + \frac{( - 1 ) ^{k}}{a _{k - 1} ∥ r _{k - 1} ∥} γ_{k} .

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{b _{k - 1}}{a _{k - 2}} \frac{∥ r _{k - 2} ∥}{∥ r _{k - 1} ∥} q_{k - 1} + \frac{( - 1 ) ^{k}}{a _{k - 1} ∥ r _{k - 1} ∥} γ_{k} .

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{∥ r _{k - 1} ∥}{a _{k - 2} ∥ r _{k - 2} ∥} q_{k - 1} + \frac{( - 1 ) ^{k}}{a _{k - 1} ∥ r _{k - 1} ∥} γ_{k} .

A q_{k} = \frac{∥ r _{k} ∥}{a _{k - 1} ∥ r _{k - 1} ∥} q_{k + 1} + (\frac{1}{a _{k - 1}} + \frac{b _{k - 1}}{a _{k - 2}}) q_{k} + \frac{∥ r _{k - 1} ∥}{a _{k - 2} ∥ r _{k - 2} ∥} q_{k - 1} + \frac{( - 1 ) ^{k}}{a _{k - 1} ∥ r _{k - 1} ∥} γ_{k} .

\frac{( - 1 ) ^{k}}{a _{k - 1} ∥ r _{k - 1} ∥} γ_{k} = \frac{( - 1 ) ^{k}}{∥ r _{k - 1} ∥} (\frac{b _{k - 1}}{a _{k - 2}} δ_{r_{k - 1}} + A δ_{p_{k - 1}} - \frac{1}{a _{k - 1}} δ_{r_{k}}) .

\frac{( - 1 ) ^{k}}{a _{k - 1} ∥ r _{k - 1} ∥} γ_{k} = \frac{( - 1 ) ^{k}}{∥ r _{k - 1} ∥} (\frac{b _{k - 1}}{a _{k - 2}} δ_{r_{k - 1}} + A δ_{p_{k - 1}} - \frac{1}{a _{k - 1}} δ_{r_{k}}) .

r_{k}

r_{k}

p_{k}

s_{k}

∥ δ_{r_{k}} ∥

∥ δ_{r_{k}} ∥

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On the Convergence Rate of Variants of the Conjugate Gradient Algorithm

in Finite Precision Arithmetic

Anne Greenbaum

Hexuan Liu

Tyler Chen University of Washington, Applied Mathematics Dept., Box 353925, Seattle, WA 98195. This work was supported in part by NSF grant DMS-1210886.

Abstract

We consider three mathematically equivalent variants of the conjugate gradient (CG) algorithm and how they perform in finite precision arithmetic. It was shown in [Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences, Lin. Alg. Appl., 113 (1989), pp. 7-63] that under certain conditions the convergence of a slightly perturbed CG computation is like that of exact CG for a matrix with many eigenvalues distributed throughout tiny intervals about the eigenvalues of the given matrix, the size of the intervals being determined by how closely these conditions are satisfied. We determine to what extent each of these variants satisfies the desired conditions, using a set of test problems and show that there is significant correlation between how well these conditions are satisfied and how well the finite precision computation converges before reaching its ultimately attainable accuracy. We show that for problems where the width of the intervals containing the eigenvalues of the associated exact CG matrix makes a significant difference in the behavior of exact CG, the different CG variants behave differently in finite precision arithmetic. For problems where the interval width makes little difference or where the convergence of exact CG is essentially governed by the upper bound based on the square root of the condition number of the matrix, the different CG variants converge similarly in finite precision arithmetic until the ultimate level of accuracy is achieved, although this ultimate level of accuracy may be different for the different variants. This points to the need for testing new CG variants on problems that are especially sensitive to rounding errors.

1 Introduction

Several variants of the conjugate gradient algorithm (CG) for solving a symmetric positive definite linear system $Ax=b$ have been proposed to make better use of parallelism; see, e.g., [27, 28, 19, 29, 4, 5, 11]. Here we consider three variants: the original Hestenes and Stiefel algorithm [17] (HSCG), a variant with somewhat more opportunity for parallelism due to Chronopoulos and Gear [4] (CGCG), and a more recent pipelined version due to Ghysels and Vanroose [11] (GVCG). While all of these algorithms are mathematically equivalent, they behave differently when implemented in finite precision arithmetic. Perhaps the most dramatic difference is in the ultimately attainable accuracy of the computed solution. All of these algorithms compute an initial residual $r_{0}=b-Ax_{0}$ , where $x_{0}$ is the initial guess for the solution, and then compute updated “residual” vectors $r_{k}$ , $k=1,2,\ldots$ , using a recurrence formula. In finite precision arithmetic, however, these updated vectors may differ from the true residuals $b-Ax_{k}$ , where $x_{k}$ is the approximate solution vector generated at step $k$ . When this difference becomes large, the norms of the updated vectors may or may not continue to decrease, but the true residual norm (that is, the norm of $b-Ax_{k}$ ) levels off (or may even grow). The level of accuracy of the approximate solution $x_{k}$ when this occurs is studied in [3, 6].

In this paper, we consider what happens in the stage before the true and updated residual vectors start to deviate significantly. Even during this stage, the different variants may show different convergence patterns on problems with certain eigenvalue distributions that make them especially sensitive to rounding errors. This is a phenomenon that we wish to understand. On other problems, where eigenvalues of the coefficient matrix are distributed in a more uniform way, the algorithms may all behave very similarly. This, too, is something that needs a mathematical explanation since this may hold even after agreement with exact arithmetic is lost. Specifically, we aim to give an explanation for the results shown in Figure 1, where different CG variants converge differently, and in Figure 2, where, although one variant fails to obtain as accurate a solution as the others, up until the point where the convergence curve of that computation levels off, all variants converge similarly.

A good deal of work beginning in the 1980’s (and in the thesis of Paige [23] dating back to 1971) has been aimed at explaining the behavior of the Lanczos and conjugate gradient algorithms in finite precision arithmetic, or, more generally, when the recurrence formulas for these algorithms are perturbed slightly. See, for example, [22, 13, 16, 9]. In exact arithmetic, the Lanczos algorithm can be thought of as a part of the CG algorithm: The Lanczos algorithm generates a sequence of tridiagonal matrices and the CG algorithm implicitly solves linear systems with these tridiagonal matrices in order to approximate the solution of the linear system $Ax=b$ . In a seminal paper [22], Paige showed that what is now the standard implementation of the Lanczos algorithm maintained certain local orthogonality and normalization properties even when implemented in finite precision arithmetic and that those properties could be used to establish results about the eigenvalue/vector approximations generated during a finite precision Lanczos computation. A nice summary of this work can be found in [26], and more recent work by Paige has extended these results significantly [24, 25]. Later, these same properties were used in [13] to establish results about the convergence of the conjugate gradient algorithm under the assumption that these properties were satisfied by a finite precision, or otherwise slightly perturbed, implementation. A natural question to ask is: Which of the various proposed implementations satisfy these properties, and do those that do satisfy the properties used in Paige’s analysis have better behavior than those that do not? For those that do not, are there other ways to explain their behavior? In this paper, we consider three mathematically equivalent variants of the conjugate gradient algorithm: the original Hestenes and Stiefel variant [17] (HSCG), a variant with somewhat more opportunity for parallelism due to Chronopoulos and Gear [4] (CGCG), and a more recent pipelined variant due to Ghysels and Vanroose [11] (GVCG).

In the following subsections, we review the properties that have been assumed in order to relate finite precision or otherwise slightly perturbed CG computations to exact CG for a larger matrix with eigenvalues in small intervals about those of the given matrix. We do not rigorously prove these properties for any particular implementation but give an indication why one of them may be expected to fail in the GVCG variant, and we check numerically whether or not they hold for a number of test problems and whether satisfaction of such properties coincides with faster convergence (in terms of number of iterations). Again, the analysis of this paper deals only with steps at which the true and updated residuals are still in close agreement, that is, before the best level of accuracy is reached. The reader is referred to [3, 6] for a discussion of what this best level of accuracy is for different CG variants.

Throughout the paper, $A$ will denote a real symmetric positive definite matrix, although the results are easily extended to complex Hermitian positive definite matrices. The symbol $\|\cdot\|$ will denote the 2-norm for vectors and the corresponding spectral norm for matrices.

1.1 Slightly Perturbed Lanczos Computations

In [13], an analogy was established between finite precision, or otherwise slightly perturbed, Lanczos computations with matrix $A$ and initial vector $q_{1}$ and exact Lanczos applied to a larger matrix $T$ whose eigenvalues all lie in tiny intervals about the eigenvalues of $A$ . More precisely, it was shown that if $T_{J}$ is the tridiagonal matrix produced at step $J$ of a finite precision computation with matrix $A$ , then $T_{J}$ can be extended to a larger tridiagonal matrix $T$ whose eigenvalues are all close to eigenvalues of $A$ , assuming that the finite precision computation satisfies certain local orthogonality and normalization properties. An algorithm was given for producing such an extension, and this algorithm is outlined in the Appendix of this paper. When exact Lanczos is applied to $T$ with initial vector $\hat{q}_{1}$ equal to the first unit vector, it produces the same tridiagonal matrices $T_{1},\ldots T_{J}$ as the finite precision computation. If $\lambda_{i\ell}$ , $\ell=1,2,\ldots$ , are the eigenvalues of $T$ that are close to eigenvalue $\lambda_{i}$ of $A$ , and if $v_{i\ell}$ , $\ell=1,2,\ldots$ , are the corresponding orthonormal eigenvectors of $T$ , while $v_{i}$ is a normalized eigenvector of $A$ associated with $\lambda_{i}$ , then it was shown that

[TABLE]

This meant that theorems (that assume exact arithmetic) about the behavior of the first $J$ steps of the Lanczos algorithm applied to such matrices $T$ with such initial vectors $\hat{q}_{1}$ could be applied to finite precision computations with matrix $A$ and initial vector $q_{1}$ .

The assumptions needed for this analogy to hold were that vectors $q_{1},\ldots,q_{J}$ generated by the finite precision computation satisfied

[TABLE]

where $Q_{J}$ is the $n$ by $J$ matrix whose columns are $q_{1},\ldots,q_{J}$ , $T_{J}$ is the symmetric tridiagonal matrix

[TABLE]

$\xi_{J}$ is the $J$ th unit vector $(0,\ldots,0,1)^{T}$ , and $F_{J}$ , which accounts for rounding errors, has columns $f_{j}$ , $j=1,\ldots,J$ , satisfying

[TABLE]

where $\epsilon$ is tiny (ideally, a modest multiple of the machine precision). It is further assumed that, because of the choice of the coefficients, $\alpha_{1},\ldots,\alpha_{J},\beta_{1},\ldots,\beta_{J}$ , the 2-norm of each vector $q_{j}$ is approximately $1$ , say, $\|q_{j}\|\in[1-\epsilon,1+\epsilon]$ , and the inner product of successive pairs of vectors is almost [math]:

[TABLE]

The analysis in [13] applies to any computation that satisfies these assumptions for some $\epsilon\ll 1$ . One can extend the tridiagonal matrix $T_{J}$ in the way described in [13] even if $\epsilon$ is not very small, but then there is no guarantee that the eigenvalues of the extended matrix $T$ will all be close to those of $A$ or even that $T$ will be positive definite. This analysis relies heavily on the work of Paige [22, 23], who showed that a good finite precision implementation of the Lanczos algorithm satisfies these assumptions, with explicit bounds on the quantities denoted here as $\epsilon$ .

1.2 Relation Between CG Residuals and Lanczos Vectors

One way to solve a symmetric positive definite linear system $Ax=b$ is to use the Lanczos algorithm to generate the matrices $Q_{J}$ and $T_{J}$ in (1), taking $q_{1}$ to be the normalized initial residual: If $x_{0}$ is an initial guess for the solution of the linear system and $r_{0}=b-Ax_{0}$ , then $q_{1}=r_{0}/\beta$ , where $\beta=\|r_{0}\|$ . Then the approximate solution $x_{J}$ is taken to be

[TABLE]

This choice of $x_{J}$ minimizes the $A$ -norm of the error, $\langle A^{-1}b-x_{J},b-Ax_{J}\rangle^{1/2}$ , among all vectors of the form $x_{0}+Q_{J}y$ , $y\in\mathbb{R}^{J}$ . See, for example, [15, Sec. 2.5]. The conjugate gradient algorithm does this implicitly by generating Cholesky-like factorizations of the successive tridiagonal matrices $T_{k}$ , $k=1,\ldots,J$ : $T_{k}=L_{k}D_{k}L_{k}^{T}$ , where $L_{k}$ is a unit lower bidiagonal matrix and $D_{k}$ is a diagonal matrix. Thus, rounding errors in the conjugate gradient algorithm involve not only those in constructing the columns of $Q_{J}$ but also those in solving the linear system involving $T_{J}$ . Since the two processes are intertwined, the effect of rounding errors can be more difficult to analyze in the conjugate gradient algorithm than in the Lanczos algorithm.

The conjugate gradient algorithm for solving a symmetric positive definite linear system $Ax=b$ can be written in the following form, due to Hestenes and Stiefel [17] (HSCG):

As noted above, it is well-known that if $q_{1}=r_{0}/\|r_{0}\|$ in the Lanczos algorithm, then subsequent Lanczos vectors are just scaled versions of the corresponding CG residuals. To see this from the HSCG algorithm, we first note that the residual vectors $r_{k}$ , $k=0,1,\ldots$ , satisfy a 3-term recurrence:

[TABLE]

If we define normalized residuals by $q_{k+1}:=(-1)^{k}\frac{r_{k}}{\|r_{k}\|}$ , then these vectors satisfy

[TABLE]

or,

[TABLE]

Finally, noting that $b_{k-1}=\|r_{k-1}\|^{2}/\|r_{k-2}\|^{2}$ , this becomes

[TABLE]

Thus, if $Q_{J}$ is the $n$ by $J$ matrix whose columns are $q_{1},\ldots,q_{J}$ , then

[TABLE]

where $\beta_{J}=\|r_{J}\|/(a_{J-1}\|r_{J-1}\|)$ and $T_{J}$ is a symmetric tridiagonal matrix with diagonal entries $\alpha_{j}=1/a_{j-1}+b_{j-1}/a_{j-2}$ , $j=1,2,\ldots,J$ , (where terms involving $a_{-1}$ are taken to be [math]) and sub and super diagonal entries $\beta_{j}=\|r_{j}\|/(a_{j-1}\|r_{j-1}\|)$ , $j=1,\ldots,J-1$ . It follows that if formula (6) can be replaced by something of the form (1) when the columns of $Q_{J}$ come from normalizing “residual” vectors $r_{k}$ in a finite precision CG computation, with the computed vectors satisfying properties (2) and (3) as well, then the analysis of [13] will apply to the finite precision CG computation. We emphasize again that the analysis in [13] gives information about the rate at which the updated residual vectors $r_{k}$ decrease in norm and thus is of interest only as long as these updated vectors resemble the true residuals, $b-Ax_{k}$ .

1.3 Implications for Finite Precision CG Implementations

Under the assumption that formulas (1), (2), (3) hold for some small value $\epsilon$ , when the columns of $Q_{J}$ come from normalizing updated CG residual vectors and the entries of $T_{J}$ are derived from CG coefficients as described above, the analysis in [13] shows that the updated CG residual vectors converge at the rate predicted by exact arithmetic theory for a symmetric positive definite matrix $T$ whose condition number $\kappa$ is just slightly larger than that of $A$ :

[TABLE]

although this may be a substantial overestimate. It shows further that the $A$ -norm of the error in the finite precision computation – that is, the quantity $\langle r_{k},A^{-1}r_{k}\rangle^{1/2}$ – is reduced at about the same rate as the $T$ -norm of the error in exact CG applied to $T$ :

[TABLE]

which again may be an overestimate.

A sharper bound on the quantities on the left in (7) and (8) can be given in terms of the size of the $k$ th degree minimax polynomial on the union of tiny intervals containing the eigenvalues of $T$ ; if these intervals are $[\lambda_{i}-\delta,\lambda_{i}+\delta]$ , then the quantity

[TABLE]

is an upper bound for the quantity on the left in (8) and $\kappa^{1/2}$ times this value is an upper bound for the left-hand side of (7). For some eigenvalue distributions, such as eigenvalues fairly uniformly distributed between $\lambda_{\text{min}}$ and $\lambda_{\text{max}}$ , the size of this minimax polynomial is not much less than that of the Chebyshev polynomial on the entire interval $[\lambda_{\text{min}}-\delta,\lambda_{\text{max}}+\delta]$ , on which the bounds in (7) and (8) are based. However, for other eigenvalue distributions such as eigenvalues tightly clustered at the lower end and highly spread out at the upper end, as will be seen in one of our examples, the difference can be great.

These bounds are independent of the initial residual $r_{0}$ . With knowledge of the size of components of $r_{0}$ in the directions of each eigenvector of $A$ , the analysis in [13] gives additional insight into the convergence of a finite precision CG computation that satisfies the assumptions in [13]. It behaves like exact CG applied to a matrix whose eigenvalues lie in tiny intervals about the eigenvalues of $A$ , with an initial residual $\hat{r}_{0}$ satisfying

[TABLE]

where $v_{i}$ is a normalized eigenvector of $A$ corresponding to eigenvalue $\lambda_{i}$ , $v_{i_{\ell}}$ is a normalized eigenvector of $T$ corresponding to eigenvalue $\lambda_{i_{\ell}}$ , and the sum over $\ell$ is the sum over all eigenvalues of $T$ in the small interval $[\lambda_{i}-\delta,\lambda_{i}+\delta]$ . In some cases, even assuming exact arithmetic where $\delta=0$ , bounds based on the size of the minimax polynomial on the set of eigenvalues are large overestimates for observed convergence rates. While for any given $k$ , there is an initial residual for which equality will hold at step $k$ [12], components of that initial residual may differ by hundreds of orders of magnitude. Such an initial residual could not even be represented on a machine with standard bounds on exponent size, so whatever the initial residual in the finite precision computation, it is necessarily far from the worst possible one.

2 CG Variants Designed for Parallelism

While individual matrix-vector multiplications can be parallelized and vectors can be partitioned among different processors in the HSCG algorithm, almost none of the high-level operations comprising an iteration in that algorithm can be performed simultaneously. Looking at the algorithm of the previous section, it can be seen that at each iteration, the matrix-vector product $Ap_{k-1}$ must be started, with at least part of it completed, before computation of the inner product $\langle p_{k-1},Ap_{k-1}\rangle$ can begin. This inner product must be completed before the vectors $x_{k}$ and $r_{k}$ can be formed, and $r_{k}$ must be at least partly completed before computation of the next inner product $\langle r_{k},r_{k}\rangle$ can begin. This inner product must be completed before $p_{k}$ can be formed, and at least part of $p_{k}$ must be completed before the start of the next iteration computing $Ap_{k}$ . It has been observed that waiting for the two inner products to complete can be very costly when using large numbers of processors [1, 6].

Several mathematically equivalent CG variants have been devised to allow overlapping of inner products with each other and with the matrix-vector multiplication in each iteration of the algorithm. In the following sections, we consider two of these: one due to Chronopoulos and Gear [4, 5] (CGCG) that allows either overlapping of the two inner products or overlapping of one of these with the matrix-vector product, and a pipelined version due to Ghysels and Vanroose [11] (GVCG) that allows overlapping of both inner products as well as the matrix-vector multiplication. We give an indication of why the value of $\epsilon$ in (2) might be expected to be larger for the GVCG variant than for HSCG and CGCG, and we demonstrate that it is, indeed, larger for a set of test problems. This does not necessarily mean slower convergence, however; it means simply that in the analogy with exact CG for a matrix whose eigenvalues are in intervals about the eigenvalues of $A$ , the interval size must be larger for GVCG. If this interval size makes a significant difference in the convergence rate of exact CG, then we expect the finite precision GVCG computation to require more iterations than the other variants. Again, we emphasize that our analysis applies only before agreement between true and updated residual vectors is lost and before the ultimate level of accuracy is achieved. All of our experiments are performed on a single processor using standard double precision arithmetic, and we do not consider the timing of the algorithms, only the number of iterations required to reach a given level of accuracy (assuming that that level of accuracy is reached by the algorithm).

Since it is known that a good implementation of the Lanczos algorithm satisfies (1 – 3) [22, 23], we also compared the CG variants with results obtained by using the Lanczos algorithm to generate $Q_{J}$ and $T_{J}$ in (6) and then using extra precision to compute $x_{J}$ in (4). These results were very similar to those obtained with HSCG, so they are not included in the plots.

2.1 HSCG

When the Hestenes and Stiefel algorithm of the previous section is implemented in finite precision arithmetic, the vectors $r_{k}$ and $p_{k}$ satisfy

[TABLE]

The roundoff terms can be bounded, as in [30], using standard results on floating point arithmetic; see, e.g., [18]. For a scalar $\alpha$ , $n$ -vectors $v$ and $w$ , and an $n$ by $n$ matrix $A$ , we have

[TABLE]

where $\epsilon$ is the machine precision and $\mbox{fl}[\cdot]$ denotes the result of floating point evaluation. If $A$ has at most $m$ nonzeros per row and the matrix-vector product is computed in the standard way, then $c$ can be taken to be $mn^{1/2}$ ; alternatively, $c\|A\|$ in (11) can be replaced by $m\|~{}|A|~{}\|$ , where $|A|$ is the $n$ by $n$ matrix whose entries are the absolute values of those of $A$ . Applying these rules to the formulas in the HSCG algorithm, we can write

[TABLE]

Similarly, the errors in the computed coefficients $a_{k-1}$ and $b_{k}$ can be bounded as in [30]. Here we will assume that the coefficients satisfy the formulas in the HSCG algorithm, namely,

[TABLE]

and we will include any errors in computing these formulas in the $\delta_{r_{k}}$ and $\delta_{p_{k}}$ terms. It follows, as in [13], that

[TABLE]

where

[TABLE]

Defining $q_{k+1}:=(-1)^{k}r_{k}/\|r_{k}\|$ , we can write

[TABLE]

or,

[TABLE]

With the formula for $b_{k-1}$ , this takes a form similar to (5):

[TABLE]

From (12), the roundoff term in (13) can be written as

[TABLE]

This is the $k$ th column of $F_{J}$ in (1). Note that it involves only local rounding errors and might therefore be expected to be of order $\epsilon\|A\|$ , where $\epsilon$ is a moderate multiple of the machine precision. We will see in later examples that this is indeed the case.

2.2 CGCG

Chronopoulos and Gear proposed the following version of the CG algorithm to make better use of parallelism [4]:

Notice that the computation of $\nu_{k}=\langle r_{k},r_{k}\rangle$ can be overlapped with that of $w_{k}=Ar_{k}$ . Alternatively, once $w_{k}=Ar_{k}$ has been formed, the two inner products $\nu_{k}=\langle r_{k},r_{k}\rangle$ and $\eta_{k}=\langle r_{k},w_{k}\rangle$ can be computed simultaneously. In exact arithmetic, the additional vector $s_{k}$ is equal to $Ap_{k}$ .

The CGCG algorithm can be written in the form (5) in much the same way as the HSCG algorithm. For the finite precision computation, the relevant formulas are

[TABLE]

where the roundoff terms $\delta_{r_{k}}$ , $\delta_{p_{k}}$ , and $\delta_{s_{k}}$ can be bounded using (11), very similarly to those in HSCG:

[TABLE]

Again, we will assume that the coefficients $a_{k-1}$ and $b_{k}$ satisfy the formulas in the CGCG algorithm, with any errors in evaluating these formulas included in the $\delta_{r_{k}}$ , $\delta_{p_{k}}$ , and $\delta_{s_{k}}$ terms.

Eliminating the $s_{k}$ ’s and $p_{k}$ ’s, we can obtain a three-term recurrence for $r_{k}$ :

[TABLE]

where

[TABLE]

Defining $q_{k+1}:=(-1)^{k}r_{k}/\|r_{k}\|$ , and proceeding exactly as was done for HSCG, we obtain equation (13), where now the roundoff term is

[TABLE]

Again, this involves only local rounding errors. Comparing (14) and (15), we see that they look very similar except that $A\delta_{p_{k-1}}$ in (14) is replaced by $\delta_{s_{k-1}}$ in (15), where, in exact arithmetic, $s_{k-1}=Ap_{k-1}$ .

2.3 GVCG

This algorithm, developed by Ghysels and Vanroose [11] and also known as pipelined CG, offers the most opportunity for overlapping parallel computations:

Note that the computation of both inner products $\langle r_{k},r_{k}\rangle$ and $\langle w_{k},r_{k}\rangle$ required at each iteration can be overlapped with each other and with the matrix vector product, $t_{k}=Aw_{k}$ , as well as with some of the vector operations. In exact arithmetic, the auxiliary vectors satisfy $s_{k}=Ap_{k}$ , $w_{k}=Ar_{k}$ , $u_{k}=As_{k}=A^{2}p_{k}$ , $t_{k}=Aw_{k}=A^{2}r_{k}$ .

In finite precision arithmetic, the vectors in the GVCG algorithm satisfy

[TABLE]

where, again using (11), we see that the roundoff terms satisfy

[TABLE]

When we try to eliminate auxiliary vectors and form a three-term recurrence for $r_{k}$ , we find that

[TABLE]

It was noted that $w_{k-1}=Ar_{k-1}$ in exact arithmetic, so we can write this recurrence in the form

[TABLE]

The amount by which the computed vector $r_{k}$ fails to satisfy a three-term recurrence now depends not only on local rounding errors, but also on the amount by which $w_{k-1}$ differs from $Ar_{k-1}$ . This will involve rounding errors made at all previous steps. To see the size of this difference, subtract $A$ times the equation for $r_{k-1}$ from the equation for $w_{k-1}$ :

[TABLE]

and apply this recursively to obtain

[TABLE]

To determine the size of the difference between $u_{j}$ and $As_{j}$ , subtract $A$ times the equation for $s_{j}$ from that for $u_{j}$ and apply recursively to find

[TABLE]

Finally, noting that $b_{\ell}=\|r_{\ell}\|^{2}/\|r_{\ell-1}\|^{2}$ , one can replace the above products to obtain

[TABLE]

Substituting this expression into (17) and the result into (16), we see the amount by which $r_{k}$ may fail to satisfy the three-term recurrence that it satisfied in the other algorithms to within local roundoff errors:

[TABLE]

This suggests that the matrix $F_{J}$ in (1) may be significantly larger for this algorithm than for the others, since it involves roundoff terms from all steps of the computation, and roundoff terms that are small compared to, say, $\|r_{0}\|$ might not be so small compared to $\|r_{k}\|$ .

Note that variants of the rounding error expressions derived in this subsection have been presented in [3, 6]. Note also that a shifted version of GVCG [7] has recently been proposed to avoid some of the accuracy problems with the original version. In fact, on all of the test problems in Figure 2 of this paper, this version (using an appropriate shift based on estimates of the largest and smallest eigenvalues of the matrix) achieves the same ultimate level of accuracy as HSCG and CGCG. However, it is still the case that the entries of the matrix $F_{J}$ in (1) are larger than for HSCG and CGCG. Using this version, we have not been able to improve significantly on the convergence rate before the ultimate level of accuracy is achieved for the problems shown in Figures 1 and 6. Other pipelined CG variants have also been proposed in [8]. The analysis of these procedures is beyond the scope of this paper, but again the goal has been improvement in the ultimate level of accuracy. In fact, the authors of [8] state that “… it is well known that Krylov subspace methods may also suffer from delay of convergence due to loss of basis orthogonality…. Analyzing the stability issues related to loss of orthogonality deserves to be treated as part of future work.”

3 Some Test Problems

The following problem –bcsstk03 from the BCSSTRUC1 (BCS Structural Engineering Matrices) in the Harwell-Boeing collection [10] – was studied in [3]. It is a $112\times 112$ matrix with condition number $6.8e+6$ . For convenience, we normalized the matrix so that the matrix we used had spectral norm 1. In exact arithmetic, the CG algorithm would obtain the exact solution in at most $112$ steps. Results of running HSCG, CGCG, and GVCG are plotted in Figure 1. We set a random solution vector $x$ and computed $b=Ax$ , and we used a zero initial guess $x_{0}$ . Computations were carried out in MATLAB, using standard double precision arithmetic. To be sure that we had the true solution $A^{-1}b$ to double precision accuracy, we solved the linear system $Ax=b$ directly using higher precision and replaced the original vector $x$ by the one obtained by rounding the multiprecision solution to double precision. The figure shows the $A$ -norm of the error at each step $k$ , $\langle A^{-1}b-x_{k},b-Ax_{k}\rangle^{1/2}$ , divided by the $A$ -norm of the initial error. Also shown is the upper bound (8), which is a large overestimate for all variants. Note that the different variants of CG not only reach different levels of accuracy, but even before the ultimate accuracy level is reached, they converge at different rates. The fastest (in terms of number of iterations) is HSCG, followed by CGCG, with GVCG requiring the most iterations.

The situation is different, however, for other matrices in this collection. Figure 2 shows the convergence of HSCG, CGCG, and GVCG for six other test matrices. Here we used the diagonal of the matrix as a preconditioner but, to avoid possibly different rounding errors in preconditioned variants, each matrix was prescaled by its diagonal (that is, $A$ was replaced by $D^{-1/2}AD^{-1/2}$ , where $D=\mbox{diag}(A)$ ) and the value $\kappa$ printed on each plot is the condition number of the prescaled matrix. Again, we set a random solution vector (and computed the right-hand side as the product of the prescaled matrix times the random solution vector) and a zero initial guess. While there is still some difference in the attainable level of accuracy for the different variants, until this level is reached, all methods converge at essentially the same rate. The bound (8) is shown as well, and while this provides a good estimate of the actual convergence rate for some of the problems, it is a large overestimate for others. Note that here we are plotting the true $A$ -norm of the error $\langle A^{-1}b-x_{k},b-Ax_{k}\rangle^{1/2}$ and once $b-Ax_{k}$ starts to differ substantially from the updated vector $r_{k}$ , one would not expect this quantity to continue to decrease; we see that with GVCG it may even grow.

For several of these problems, we computed

[TABLE]

where $Q_{k}$ and $T_{k}$ were computed from the CG residuals and coefficients, to determine if the tridiagonal linear systems were being solved accurately. In all cases, this quantity was tiny, indicating that all variants are accurately solving the tridiagonal system. The difference must be in the tridiagonal matrices that they are producing. For all of the problems, the tridiagonal matrix produced at the last step shown in Figure 2 for the GVCG computation was indefinite, while the condition numbers of the final tridiagonal matrices produced in the HSCG and CGCG computations were essentially equal to that of $A$ . Note, however, that these computations were run well past the point where the true and updated residual vectors started to differ significantly, and the tridiagonal matrices produced while the true and updated residual vectors were still in close agreement in the GVCG algorithm were positive definite and had condition numbers approaching that of $A$ , similar to the other methods.

We also computed the quantities

[TABLE]

where $f_{k}$ is the $k$ th column of $F_{J}$ in (1), when $T_{J}$ , $Q_{J}$ , $\beta_{J}$ , and $q_{J+1}$ come from the CG residuals and coefficients, to see if conditions (2) and (3) hold. In all cases $\epsilon_{2}$ was tiny, that is, a modest multiple of the machine precision. The same was true for $\epsilon_{1}$ in the HSCG and CGCG computations, but not in GVCG. This might be expected based on arguments in the previous section. We observed, however, that for most steps before $r_{k}$ and $b-Ax_{k}$ started to differ greatly and before GVCG’s ultimate level of accuracy was reached, the value of $\epsilon_{1}$ in GVCG, while larger than that in HSCG and CGCG, was still less than about $1.0e-7$ . We reasoned that for these steps, while the HSCG and CGCG computations behaved like exact CG for a problem with eigenvalues throughout tiny intervals about the eigenvalues of $A$ , the GVCG computation might behave like exact CG for a problem with eigenvalues throughout small, but not as small, intervals about the eigenvalues of $A$ . If the interval size made a significant difference in the convergence of exact CG, then one would expect slower convergence from GVCG (as seen with bcsstk03), while if the interval size made little difference in the convergence of exact CG, then one would expect all three variants to converge at about the same rate (as seen with the other bcsstk problems).

To test this hypothesis, we computed the eigenvalues of several of the matrices: bcsstk03, bcsstk14, bcsstk15, bcsstk16, and bcsstk27. For each, we formed a larger (diagonal) matrix $\hat{A}$ with $11$ eigenvalues distributed about each eigenvalue of the given matrix, in intervals of width $1.0e-14$ or $1.0e-7$ . Our aim was to determine if the convergence of exact CG was significantly affected by this interval size, so we used multiple precision arithmetic and full reorthogonalization of the CG residuals to emulate exact arithmetic. We used the same random solution vector for both interval sizes, and a zero initial guess was used. The results are shown in Figure 3. We ran the computations only to the step where $\epsilon_{1}$ in GVCG started to exceed $1.0e-7$ . As noted above, after this point, the true and updated GVCG residual vectors started to diverge and progress soon ceased. The interval sizes $1.0e-14$ and $1.0e-7$ roughly coincided with the size of $\epsilon_{1}$ for HSCG/CGCG and for GVCG, respectively.

Note that the interval width makes a large difference in the convergence of exact CG for the matrix $\hat{A}$ associated with the bcsstk03 matrix, and the different CG variants behaved very differently in finite precision arithmetic. For the other problems, the interval width makes little difference in the convergence of exact CG for the matrix $\hat{A}$ , and all CG variants behaved similarly in finite precision arithmetic (at least, over the steps shown in Figure 3). Again, it is our goal to explain the behavior of the different CG variants at these steps and not at later steps when agreement between true and updated residuals has been lost.

To further understand the behavior of the different algorithms for the bcsstk03 problem, we used the procedure described in [13] and outlined in the Appendix of this paper to construct a matrix for which exact CG would behave like each of the different finite precision variants for the first 500 steps, which is somewhat past the point where the convergence curves in Figure 1 start to follow different trajectories. [See the Appendix for a more precise description of the relation between the finite precision computations and exact CG for this larger matrix.] There are many different matrices $T$ for which exact CG applied to $T$ would match the behavior of each of these finite precision computations [20], and the extension procedure given in [13] does not necessarily produce the matrix $T$ with eigenvalues in the smallest possible intervals about the eigenvalues of $A$ . Still, we found that for HSCG, it produced a matrix with eigenvalues in intervals of width $2.2e-9$ about the eigenvalues of $A$ , for CGCG the interval width was $1.7e-8$ , and for GVCG it was $3.0e-7$ . The theory is not precise enough to predict the difference in interval width for HSCG and CGCG based on the size of $\epsilon_{1}$ , $\epsilon_{2}$ , and $\epsilon_{3}$ , but it does appear that this interval width must be larger for CGCG, accounting for the somewhat slower rate of convergence. For illustration, the eigenvalues of the tridiagonal matrix produced at step 500 by the HSCG computation and the eigenvalues of the matrix $T$ for which exact CG matches the behavior of HSCG for the first 500 steps are plotted in Figure 4. The figure uses histograms with bins of width $1.0e-8$ about each eigenvalue of the bcsstk03 matrix and bins to represent values in between. Thus, if $\lambda_{1}<\lambda_{2}<\ldots$ are the distinct eigenvalues of the bcsstk03 matrix, then any eigenvalues less than $\lambda_{1}-1.0e-8$ are counted in bin 1, any between $\lambda_{1}-1.0e-8$ and $\lambda_{1}+1.0e-8$ are counted in bin 2, those between $\lambda_{1}+1.0e-8$ and $\lambda_{2}-1.0e-8$ are counted in bin 3, etc. Note that all eigenvalues of the matrix $T$ are in even numbered bins, meaning that they are within $1.0e-8$ of an eigenvalue of the bcsstk03 matrix, but as noted above, they were actually within $2.2e-9$ of these eigenvalues.

Finally, to understand what sorts of eigenvalue distributions lead to different convergence curves for the different finite precision implementations, we plot the eigenvalues of these matrices on a log scale in Figure 5. Note that the bcsstk03 matrix has four large eigenvalues (marked with $\circ$ ’s in the figure) that are well-separated from the others. It is known that exact CG applied to a matrix with eigenvalues throughout intervals around large well-separated eigenvalues will converge significantly slower than exact CG for a matrix that has just the few discrete outliers, since the CG polynomial associated with the intervals will have multiple roots within each interval. Moreover, the size of the intervals will determine the frequency with which roots are put down in the intervals. See [13]. The other bcsstk matrices have more small well-separated eigenvalues, which present less of a problem for finite precision computations [15, Ch. 4]. Moreover, ignoring the very small eigenvalues, the middle eigenvalues of the other bcsstk matrices range over fewer orders of magnitude than the bcsstk03 eigenvalues.

For further verification, we include a test problem from [16] whose eigenvalues are highly spread out at the upper end. Taking $n=48$ and $\rho=0.8$ , we formed a matrix with the following eigenvalues:

[TABLE]

These eigenvalues are also shown in Figure 5, in the bottom right subplot. We chose random orthonormal eigenvectors, a random solution vector, and a zero initial guess for the solution. Results of running the HSCG, CGCG, and GVCG algorithms are plotted in Figure 6. Also plotted is the upper bound (8) using $\kappa=1000$ and the quantity (9) using $\delta=1.0e-14$ and using $\delta=1.0e-7$ . The minimax polynomial on the union of intervals was computed using the Remez algorithm [2, pp. 280-289]. Note that for this problem, the interval size makes a significant difference in the size of the minimax polynomial on the union of intervals.

Although the exact solution would be obtained after $48$ steps using exact arithmetic, all of the finite precision implementations required about $100$ iterations to achieve their best level of accuracy. Note, however, that while the convergence curves for HSCG and CGCG are very similar before this point, that of GVCG is significantly worse. Again we looked at the quantity in (18) and at $\epsilon_{1}$ and $\epsilon_{2}$ in (19) and found that all were moderate multiples of the machine precision except $\epsilon_{1}$ in GVCG, which was $7.5e-5$ . Note that the finite precision GVCG computation cannot be equivalent to exact CG for a matrix with eigenvalues in intervals of width $1.0e-14$ about the eigenvalues of $A$ since the GVCG convergence curve goes above the upper bound (9), which holds for exact CG for all such matrices. The GVCG convergence curve does, however, remain below the upper bound for exact CG applied to matrices with eigenvalues in intervals of width $1.0e-7$ about the eigenvalues of $A$ .

4 Conclusions

In order to relate the behavior of a finite precision CG computation to that of exact CG for a matrix with eigenvalues in small intervals about the eigenvalues of $A$ using the analysis in [13], the finite precision computation must satisfy (2) and (3) for some small number $\epsilon\ll 1$ . Even if $\epsilon$ is not so small, the finite precision computation may satisfy the bounds (7) and (8) provided only that the tridiagonal matrix that it produces has its eigenvalues essentially between the largest and smallest eigenvalues of $A$ and provided formula (4) is satisfied to a close approximation. These are the conditions that one should check in designing new implementations.

It was observed numerically that all three of the implementations considered here – HSCG, CGCG, and GVCG – satisfied (3) and (4). The only difference was that GVCG did not satisfy (2) as well as the others. Yet even GVCG satisfied (2) reasonably well up to a point, and after that point progress soon ceased. What happened before that point? For most problems, failure to satisfy (2) to the order of machine precision made little difference – the GVCG computation behaved like exact CG for a matrix with eigenvalues in small but not tiny intervals about the eigenvalues of $A$ , and the behavior of exact CG was not very sensitive to the precise size of the intervals. For the bcsstk03 matrix, however, like the model problem with eigenvalues highly spread out at the upper end, that is not the case. The behavior of exact CG for a matrix with eigenvalues throughout small intervals about the eigenvalues of those matrices is very sensitive to the size of the intervals, and the different CG variants behaved differently in finite precision arithmetic even before the final level of accuracy was reached. Thus, in evaluating a new implementation of the CG algorithm, one must be careful to include test problems that stress the effect of rounding errors on the convergence rate. It may be that the cost of extra iterations is outweighed by the advantages of parallelism or other things, but this should be included in the overall evaluation of the method.

Acknowledgment: The authors thank the referees for their careful reading of this manuscript and their many helpful suggestions.

Appendix: Software

The MATLAB codes used to produce plots in this paper can be found at https://github.com/HexuanLiu/Conjugate_gradient.

The most interesting of these is extendT.m, which takes as input a symmetric positive definite matrix $A$ , a symmetric tridiagonal matrix $T_{J}$ and a set of unit vectors $q_{1},\ldots,q_{J}$ stored as columns of a matrix $Q_{J}$ (such as those returned by HSCG.m, CGCG.m, or GVCG.m), and the number of digits ndigits to use with MATLAB’s variable precision arithmetic toolbox. It returns a multiprecision symmetric tridiagonal matrix T_vpa that is an extension of $T_{J}$ whose eigenvalues are, hopefully, all close to eigenvalues of $A$ . It also returns a multiprecision array Q_vpa whose columns each have norm 1 and such that A * Q_vpa is approximately equal to Q_vpa * T_vpa. It uses the procedure outlined in [13] to construct T_vpa and Q_vpa. This procedure is described below.

Assume that the input variables satisfy

[TABLE]

Let $T_{J}S_{J}=S_{J}\Theta_{J}$ be an eigendecomposition of $T_{J}$ , and define $Y_{J}:=Q_{J}S_{J}$ . Multiplying (21) on the right by $S_{J}$ , we have

[TABLE]

Let $y_{1},\ldots y_{J}$ denote the columns of $Y_{J}$ (referred to as Ritz vectors) and let $\theta_{1},\ldots,\theta_{J}$ denote the diagonal entries of $\Theta_{J}$ (Ritz values). Define a Ritz value $\theta_{i}$ to be well-separated from the others if

[TABLE]

where, initially, cluster_width is taken to be the square root of the machine precision; otherwise, consider it to be part of a cluster. For clustered Ritz values, define a cluster vector by

[TABLE]

where $S_{J,\ell}$ is the $(J,\ell)$ -entry of $S_{J}$ and the sum is over all Ritz vectors corresponding to Ritz values in the cluster. Define a cluster value by

[TABLE]

We will say that a Ritz vector $y_{i}$ corresponding to a well-separated Ritz value is converged if $\beta_{J}|S_{J,i}|\leq\verb+(conv_tol)+\|A\|$ , where, initially, conv_tol is taken to be the square root of the machine precision; otherwise, it is unconverged. We will say that a cluster vector $y^{C}$ is converged if $\beta_{J}w_{C}\leq\verb+(conv_tol)+\|A\|$ and unconverged otherwise. Let $\hat{Y}_{m}$ have $m$ columns consisting of the unconverged Ritz vectors and the unconverged cluster vectors.

Assuming that $\epsilon_{1}$ and $\epsilon_{2}$ in (19) are on the order of machine precision, it is argued in [13] that the columns of $\hat{Y}_{m}$ should be almost orthonormal, $q_{J+1}$ should be almost orthogonal to the columns of $\hat{Y}_{m}$ , and $q_{J}$ should be almost equal to a linear combination of these columns. Since not all CG variants maintain $\epsilon_{1}$ and $\epsilon_{2}$ at the level of machine precision, information is printed out to show how closely these properties are satisfied, and the user is given an opportunity to adjust cluster_width and conv_tol to better satisfy these properties. For HSCG applied to the bcsstk03 problem, we achieved the best results by taking cluster_width and conv_tol to be $1.0e-9$ , while we left them at the square root of the machine precision for CGCG, and we set them to $1.0e-6$ for GVCG.

Once the columns of $\hat{Y}_{m}$ are determined, the rest of the code is straightforward. The next Lanczos vector $q_{J+1}$ is modified (slightly) to be exactly (that is, to ndigits precision) orthogonal to the columns of $\hat{Y}_{m}$ . Successive vectors satisfy the usual 3-term Lanczos recurrence, with the coefficients being used to extend $T_{J}$ , but the recurrence is perturbed to make the new vectors exactly orthogonal to each other and to the columns of $\hat{Y}_{m}$ . This means that the algorithm will terminate with $q_{J+n-m}$ equal to [math] and with a matrix T_vpa of size $J+n-m$ whose eigenvalues should all be close to eigenvalues of $A$ (assuming, again, that $\epsilon_{1}$ and $\epsilon_{2}$ are sufficiently small).

The driver code, bcsstk03magic.m, runs either HSCG, CGCG, or GVCG and then calls extendT to extend the tridiagonal matrix to one whose eigenvalues are close to eigenvalues of $A$ . It then runs cg_vpa, a variable precision CG code using full reorthogonalization, with the matrix T_vpa and right-hand side equal to the first unit vector. It plots the 2-norm of the residual and the T_vpa-norm of the error at each step of the multiprecision CG computation to demonstrate that these values fall right on top of those from the finite precision computation with $A$ . More precisely, the 2-norms of the residuals in exact CG for T_vpa exactly match the 2-norms of the vectors $r_{k}$ in the finite precision computation, assuming that the tridiagonal matrix from the finite precision computation (of which T_vpa is an extension) consists of entries satisfying the exact coefficient formulas in the algorithms; the match between the T_vpa-norm of the error in exact CG applied to T_vpa and the quantities $\langle r_{k},A^{-1}r_{k}\rangle$ in the finite precision computations is very close but not exact.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. J. Ashby, P. Ghysels, W. Heirman, and W. Vanroose, The impact of global communication latency at extreme scales on Krylov methods , in Algorithms and Architectures for Parallel Processing , eds. Y. Xiang, I. Stojmenovic, B. O. Apduhan, G. Wang, K. Nakano, and A. Zomaya, Springer Berlin, Heidelberg, 2012, pp. 428-442.
2[2] E. K. Blum, Numerical Analysis and Computation: Theory and Practice , Addison-Wesley, Philippines, 1972.
3[3] E. C. Carson, M. Rozložník, Z. Strakoš, P. Tichý, and M. Tůma, The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology , SIAM J. Sci. Comput. 40 (2018), pp. A 3549-A 3580. Czech Republic, 2016.
4[4] A. T. Chronopoulos and C. W. Gear, s 𝑠 s -step iterative methods for symmetric linear systems , J. Comput. Appl. Math., 25 (1989), pp. 153-168.
5[5] A. T. Chronopoulos and C. W. Gear, On the efficient implementation of preconditioned s 𝑠 s -step conjugate gradient methods on multiprocessors with memory hierarchy , Parallel Comput. 11 (1989), pp. 37-53.
6[6] S. Cools, E. F. Yetkin, E. Agullo, L. Giraud, and W. Vanroose, Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined conjugate gradients method , SIAM J. Matrix Anal. Appl. 39 (2017), pp. 426-450.
7[7] S. Cools, and W. Vanroose, Numerically stable variants of the communication-hiding pipelined conjugate gradients algorithm for the parallel solution of large scale symmetric linear systems , ar Xiv:1706.05988 v 2, 2018.
8[8] S. Cools, J. Cornelis, and W. Vanroose, Numerically stable recurrence relations for the communication hiding pipelined conjugate gradient method , IEEE Transactions on Parallel and Distributed Systems 30 (2019), pp. 2507-2522.