Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method
Siegfried Cools, Jeffrey Cornelis, Wim Vanroose

TL;DR
This paper introduces a numerically stable variant of the pipelined Conjugate Gradient method, enhancing accuracy and parallel performance for large-scale linear system solutions without increasing computational cost.
Contribution
It presents a new two-term recurrence relation that improves numerical stability of the pipelined Conjugate Gradient method, enabling high accuracy regardless of pipeline length.
Findings
Achieves high accuracy independently of pipeline length
Demonstrates excellent parallel performance
Resolves stability issues in pipelined Krylov methods
Abstract
Pipelined Krylov subspace methods (also referred to as communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping global communication with essential computations like the matrix-vector product, thus hiding global communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Electromagnetic Scattering and Analysis · Advanced Numerical Methods in Computational Mathematics
Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method
Siegfried Cools∗, Jeffrey Cornelis∗, Wim Vanroose∗
Applied Mathematics Group, Department of Mathematics and Computer Science, University of Antwerp. Address: University of Antwerp, Campus Middelheim, Building G, Middelheimlaan 1, 2020 Antwerp, Belgium. E-mail: [email protected] (corresponding author). Funding: S. Cools is funded by Research Foundation Flanders (FWO) under grant 12H4617N. J. Cornelis receives funding from the University of Antwerp Research Council under the University Research Fund (BOF).Manuscript received ; revised (TBA).
Abstract
Pipelined Krylov subspace methods (also referred to as communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method, p()-CG, outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping global communication with essential computations like the matrix-vector product, thus “hiding” global communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by three-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.
Index Terms:
Krylov subspace methods, Pipelining, Parallel performance, Global communication, Latency hiding, Conjugate Gradients, Numerical stability, Inexact computations, Attainable accuracy.
AMS subject classifications: 65F10, 65N12, 65G50, 65Y05, 65N22.
1 Introduction
The family of iterative solvers known as Krylov subspace methods (KSMs) [29, 35, 43] are among the most efficient present-day methods for solving large scale sparse systems of linear equations. The mother of all Krylov subspace methods is undoubtedly the Conjugate Gradient method (CG) that was derived in 1952 [25] to the aim of solving linear systems with a symmetric positive definite and preferably sparse matrix . The CG method is one of the most widely used methods for solving said systems today, which form the basis of a plethora of scientific and industrial applications. However, driven by the essential transition from optimal single node performance towards massively parallel computer hardware over the last decades [37], the bottleneck for fast execution of Krylov subspace methods has shifted. Whereas in the past the application of the sparse matrix-vector product (spmv) was considered the most time-consuming part of the algorithm, the global synchronizations required in dot product and norm computations form the main bottleneck for efficient execution on present-day distributed memory hardware [15].
Driven by the increasing levels of parallelism in present-day HPC machines, as attested by the current strive towards exascale high-performance computing software [14], research on the elimination of the global communication bottleneck has recently regained significant attention from the international computer science, engineering and numerical mathematics communities. Sprouting largely from pioneering work on reducing communication in Krylov subspace methods from the late 1980’s and 90’s [40, 31, 6, 12, 13, 17], a number of variants of the classic Krylov subspace algorithms have been introduced over the last years. We point out recent work by Chronopoulos et al. [7], Hoemmen [27], Carson et al. [4], McInnes et al. [30], Grigori et al. [23], Eller et al. [16], Imberti et al. [28] and Zhuang et al. [47].
The contents of the current work are situated in the research branch on so-called “pipelined” Krylov subspace methods111Note: In the context of communication reduction in Krylov subspace methods, the terminology “pipelined” KSMs that is used throughout the related applied linear algebra and computer science literature refers to software pipelining, i.e. algorithmic reformulations to the KSM procedure in order to reduce communication overhead, and should not be confused with hardware-level instruction pipelining (ILP). [18, 19, 9]. Alternatively called “communication hiding” methods, these algorithmic variations to classic Krylov subspace methods are designed to overlap the time-consuming global communications in each iteration of the algorithm with computational tasks such as calculating spmvs or axpys (vector operations of the form ). Thanks to the reduction/elimination of the synchronization bottleneck, pipelined algorithms have been shown to increase parallel efficiency by allowing the algorithm to continue scaling on large numbers of processors [36, 46]. However, the algorithmic reformulations that allows for this efficiency increase come at the cost of reduced numerical stability [19, 5], which presently is one of the main drawbacks of pipelined (as well as other communication reducing) methods. Research on analyzing and improving the numerical stability of pipelined Krylov subspace methods, which is essential both for a proper understanding and the practical usability of the methods, has recently been performed by the authors [10, 8] and others [3, 5].
This work presents a numerically stable variant of the -length pipelined Conjugate Gradient method, p()-CG for short. The p()-CG method was presented in [11] and allows to overlap each global reduction phase with the computational work of subsequent iterations. The pipeline length is a parameter of the method that can be chosen depending on the problem and hardware setup (as a function of the communication-to-computation ratio). As is the case for all communication reducing Krylov subspace methods, the preconditioner choice influences the communication-to-computation ratio and thus affects the performance of the method. The pipeline length hence also depends on the effort invested in the preconditioner. A preconditioner that uses limited global communication (block Jacobi, no-overlap DDM, …) is generally preferred in this setting.
The propagation of local rounding errors in the multi-term recurrence relations of the p()-CG algorithm is the primary source of loss of attainable accuracy on the final solution [11]. By introducing intermediate auxiliary basis variables, we derive a new p()-CG algorithm with modified recurrence relations for which no rounding error propagation occurs. It is proven analytically that the resulting recurrence relations are numerically stable for any pipeline length . The new algorithm is guaranteed to reach the same accuracy as the classic CG method. This work thus resolves one of the major restrictions for the practical use of pipelined Krylov subspace methods. The redesigned algorithm comes at no additional computational cost and has only a minor storage overhead compared to the former p()-CG algorithm, thus effectively replacing the earlier implementation of the method. In addition, it is shown that formulating a preconditioned version of the new algorithm is straightforward.
We conclude this introduction by providing a short overview of the further contents of this manuscript. Section 2 presents a high-level summary of the basic principles behind the -length pipelined CG method and formulates the key numerical properties of the method that motivate this work. It familiarizes the reader with the notations and concepts used throughout this paper. In Section 3 the numerical stability analysis of the p()-CG recurrence relations is briefly recapped, as it forms the basis for the analysis of the stable algorithm in Section 4.4. Section 4 contains the main contributions of this work, presenting the technical details of the stable p()-CG algorithm alongside an overview of its main implementation properties and a numerical analysis of the new rounding error resilient recurrence relations. Numerical experiments that demonstrate both the parallel performance of the p()-CG method and the excellent attainable accuracy in comparison to earlier variants of pipelined Krylov subspace methods are presented in Section 5. The manuscript is concluded in Section 6.
For completeness we note that the numerical analysis in Section 4 focuses on analyzing the propagation of local rounding errors throughout the new p()-CG algorithm in detail, but does not include a standard forward or backward stability analysis with bounds on the local rounding errors.
2 Deep pipelined Conjugate Gradients
The deep pipelined Conjugate Gradient method, denoted p()-CG for short, was first presented in [11], where it was derived in analogy to the p()-GMRES method [18]. The parameter represents the pipeline length which indicates the number of iterations that are overlapped by each global reduction phase. We summarize the current state-of-the-art deep pipelined p()-CG method below, which forms the starting point for the discussion in this work.
2.1 Basis recurrence relations in exact arithmetic
Let be the orthonormal basis for the Krylov subspace in iteration of the p()-CG algorithm, consisting of vectors. Here is a symmetric positive definite matrix. The Krylov subspace basis vectors satisfy the Lanczos relation
[TABLE]
with
[TABLE]
Let , then the Lanczos relation (1) translates in vector notation to
[TABLE]
The auxiliary basis runs vectors ahead of the basis and is defined as
[TABLE]
where the matrix polynomial is given by
[TABLE]
with optional stabilizing shifts , see [18, 11, 27]. We refer to Section 2.2 for a discussion on the Krylov subspace basis , i.e. the choice of the polynomial . Contrary to the basis , the auxiliary basis is in general not orthonormal. It is constructed using the recursive definitions
[TABLE]
which are obtained by multiplying the Lanczos relation (3) on both sides by . Expression (6) translates into a Lanczos type matrix relation
[TABLE]
where the matrix contains the matrix , which is shifted places along the main diagonal. The bases and are connected through the basis transformation for , where is a banded upper triangular matrix with a band width of non-zero diagonals [11]. For a symmetric matrix the matrix is symmetric around its -th upper diagonal, since
[TABLE]
The following recurrence relation for is derived from the basis transformation (with ):
[TABLE]
A total of iterations after the dot-products
[TABLE]
have been initiated, the elements with , which were computed as , are corrected as follows:
[TABLE]
for , and:
[TABLE]
Additionally, in the -th iteration the tridiagonal matrix , see (1), can be updated recursively by adding one column. The diagonal element is characterized by the expressions:
[TABLE]
The term is considered zero when . The update for the off-diagonal element is given by
[TABLE]
The element has already been computed in the previous iteration and can thus simply be copied due to the symmetry of .
Once the basis has been constructed, the solution can be updated based on a search direction , following the classic derivation of D-Lanczos in [35], Sec. 6.7.1. The Ritz-Galerkin condition
[TABLE]
implies . The LU-factorization of the tridiagonal matrix is given by
[TABLE]
Note that . It follows from (16) that the elements of the lower/upper triangular matrices and are given by (with )
[TABLE]
Expression (2.1) indicates that the approximate solution equals
[TABLE]
where and . Note that . The columns (for ) of can be computed recursively. From it follows
[TABLE]
Denoting the vector by , it follows from that and for . Using the search direction and the scalar , the approximate solution is updated using the recurrence relation:
[TABLE]
The above expressions are combined in Alg. 1. Once the initial pipeline for has been filled, the relations (6)-(9) are used to recursively compute the basis vectors and in iterations (see lines 15-16). The scalar results of the global reduction phase (line 18) are required iterations later (line 5-6). In every iteration global communication is thus overlapped with the computational work of subsequent iterations, forming the heart of the communication hiding p()-CG algorithm.
Remark 1**.**
Residual norm in p()-CG. Note that the residual is not computed in Alg. 1, but its norm is characterized by the quantity for . This quantity can be used to formulate a stopping criterion for the p()-CG iteration, see Alg. 1 line 27.
Remark 2**.**
Dot products in p()-CG. Although Alg. 1, line 18 indicates that in each iteration a total of dot products need to be computed, the number of dot product computations can be limited to by exploiting the symmetry of the matrix , see expression (2.1). Since for , only the dot products for and the -th upper diagonal element need to be computed in iteration .
2.2 *On the conditioning of the auxiliary basis *
As is an orthonormal basis, the transformation can be interpreted as a QR factorization of the auxiliary basis . Moreover, it holds that is the Cholesky factorization of , since
[TABLE]
The elements of the transformation matrix are computed on lines 5-6 of Alg. 1 precisely by means of this Cholesky factorization. This observation leads to the following two essential insights related to the conditioning of the basis .
Remark 3**.**
Square root breakdowns in p()-CG. The auxiliary basis vectors are defined as , but the basis is in general not orthogonal. Hence, vectors are not necessarily linearly independent. In particular for longer , different vectors are expected to become more and more aligned. This leads to being ill-conditioned, approaching singularity as increases. The effect is the most pronounced when , in which case . Shifts can be set to improve the conditioning of , see also Remark 4.
When for certain the matrix becomes (numerically) singular, the Cholesky factorization procedure in p()-CG will fail. This may manifest in the form of a square root breakdown on line 7 in Alg. 1 when the root argument becomes negative. Numerical round-off errors may increase the occurrence of these breakdowns in practice. When a breakdown occurs in p()-CG the iteration is restarted, in analogy to the GMRES algorithm, although it should be noted that the nature of the breakdown in both algorithms is quite different. Evidently, the restarting strategy may delay convergence compared to standard CG, where no square root breakdowns occur.
Remark 4**.**
*Choice of the auxiliary basis and relation to the shifts. *** It follows from the Cholesky factorization (21) that the inverse of the transformation matrix is
[TABLE]
The conditioning of the matrix is thus determined by the conditioning of . Furthermore, it holds that
[TABLE]
where is a part of the basis obtained by dropping the first columns. Hence, the polynomial has a major impact on the conditioning of the matrix , which in turn plays a crucial role in the propagation of local rounding errors in the p()-CG algorithm **[11]**, see Section 3.1 of the current work. This observation indicates intuitively why should preferably be as small as possible, which can be achieved by choosing appropriate values for the shifts . Optimal shift values in the sense of minimizing the Euclidean -norm of are the Chebyshev shifts **[27, 18, 11]** (for ):
[TABLE]
*which are used throughout this work. This requires a notion of the largest (smallest) eigenvalue (resp. ), which can be estimated a priori, e.g. by a few power method iterations. *
2.3 Loss of orthogonality and attainable accuracy
This work is motivated by the observation that two main issues affect the convergence of pipelined CG methods: loss of basis vector orthogonality and inexact Lanczos relations. We comment briefly on both issues from a high-level point of view and clearly mark the scope of this work. Important insights about the similarities and differences between classic CG and p()-CG are highlighted before going into more details on the numerics in Sections 3 and 4.
2.3.1 Loss of orthogonality
It is well-known that in finite precision arithmetic the orthogonality of the Krylov subspace basis , i.e. (identity matrix) may not hold exactly. Inexact orthogonality may appear in every variant of the CG algorithm, see [29], in particular in the D-Lanczos222Note: The D-Lanczos (short for “direct Lanczos”) algorithm is a variant of the CG method that is equivalent to the latter in exact arithmetic, save for the solution of the system which is computed by using Gaussian elimination in D-Lanczos. The D-Lanczos method is the basic Krylov subspace method from which the p()-CG method was derived, see [11], Section 2, for details. algorithm [35], where a new basis vector is constructed by orthogonalizing with respect to the previous two basis vectors, as well as in the related p()-CG method, Alg. 1. Loss of orthogonality typically leads to delay of convergence, meaning the residual deviates from the one in the scenario in which orthogonality would not be lost.
We use a notation with bars to designate variables that are actually computed in a finite precision implementation of the algorithm. The key relation for the Conjugate Gradient method is the Ritz-Galerkin condition:
[TABLE]
This equality only holds under the assumption that which requires . Note that in finite precision arithmetic the convergence delay can be observed in both the actual residual norm as well as the recursively computed residual norm , since both quantities are based on the (possibly non-orthogonal) basis , see Fig. 1(a)-2(a) (discussed further in Section 2.3.3).
2.3.2 The inexact Lanczos relation
The basis vectors in the pipelined CG algorithm are not computed explicitly using the Lanczos relation (1). Rather, they are computed recursively, see (9), to avoid the computation of additional spmvs. In finite precision, local rounding errors in the recurrence relation may contaminate the basis , such that the Lanczos relation is no longer valid. Moreover, due to propagation of local rounding errors, may grow dramatically as the iteration proceeds. Using the classic model for floating point arithmetic with machine precision [33, 22, 21, 45], the round-off error on basic computations on the matrix , vectors , and a scalar are bounded by
[TABLE]
Here indicates the finite precision floating point representation, is the maximum number of nonzeros in any row of , and the norm represents the Euclidean 2-norm. Under this model the recurrence relations for and in a finite precision implementation of p()-CG are
[TABLE]
with expression (27) translating in matrix notation to
[TABLE]
Recall that in exact arithmetic . In these expressions and are local rounding errors which are bounded by and , and . The actual residual satisfies the following relations in a finite precision setting:
[TABLE]
where . The recursively computed residual that appears in expression (2.3.2) tends to zero. The actual residual norm , on the other hand, stagnates around , a quantity referred to as the maximal attainable accuracy of the method. The difference between the norm of the actual residual and the recursively computed residual is illustrated in Fig. 1(a). A detailed analysis of the deviation from the Lanczos relation in finite precision (“inexact Lanczos”) can be found in Section 3.
2.3.3 Scope and limitations of this manuscript
The issue of inexact Lanczos relations in p()-CG is timely and deserving of attention. Loss of accuracy resulting from the inexact Lanczos relation has long been a limiting factor in applying p()-CG and related algorithms in practice. Fig. 1(a) illustrates how the norms of the actual residuals stagnate while the recursively computed residual norms continue to decrease. For p()-CG local rounding errors in the recurrence relations are propagated leading to reduced attainable accuracy compared to D-Lanczos. Moreover, while loss of orthogonality also warrants further investigation, this issue is not exclusive to pipelined methods. Delayed convergence is observed in classic Krylov subspace methods also, see Fig. 1, whereas loss of attainable accuracy is not. The issue could be addressed by e.g. re-orthogonalizing the basis, see [42]. However, communication-reducing methods are not particularly suitable to include re-orthogonalization, since this introduces additional global reduction phases.
Although loss of orthogonality does not originate from applying the pipelining technique, it may be more pronounced for pipelined methods compared to their classic counterparts, see Fig. 1(b). However, Fig. 1(a) indicates that the effect of the inexact Lanczos relation on convergence is much more dramatic than the effect of inexact orthogonality for all pipeline lengths . This manuscript thus focuses on improving the numerical stability of the p()-CG method by neutralizing the propagation of local rounding errors in the recursively computed basis vector updates. As such, this work proposes a key step towards a numerically stable communication hiding variant of the CG method.
3 Analyzing rounding error propagation
This section recaps the analysis of local rounding errors that stem from the recurrence relations in the pipelined p()-CG method, Alg. 1. It aims to precisely explain the source of the loss of accuracy observed for the p()-CG method. The methodology for the analysis is similar to the one used in classic works by Paige [33, 34], Greenbaum [20, 22, 21], Gutknecht [24], Strakos [41], Meurant [32], Sleijpen [38, 39], Van der Vorst [44], Higham [26], and others.
Finite precision variants of the exact scalar and vector variables introduced in Section 2 are denoted by a bar symbol in this section. We differentiate between “actual” vector variables, which satisfy the Lanczos relations exactly but are not computed in the algorithm, and “recursively computed” variables, which contain machine-precision sized round-off errors related to finite precision computations.
3.1 *Local rounding error behavior in finite precision *
For any the true basis vector, denoted by , satisfies the Lanczos relation (3) exactly, that is, for :
[TABLE]
without the addition of a local rounding error. This vector is not actually computed in the p()-CG algorithm. Instead, the computed basis vector is calculated from the finite precision variant of relation (9) for , i.e.:
[TABLE]
where the size of the local rounding errors is bounded in terms of machine precision: . Let and . Relation (30) alternatively translates to the following formulation in matrix notation (with ):
[TABLE]
where is a -by- rectangular matrix holding the entries directly below the main diagonal. The matrix collects the gaps between the actual and recursively computed basis vectors, which quantify the deviation from the Lanczos relation in the finite precision setting. These gaps are crucial in describing the propagation of local rounding errors throughout the p()-CG algorithm and are directly linked to the gap between the actual and recursively computed residuals, see expression (2.3.2). From (31) one obtains
[TABLE]
where collects the local rounding errors. The computed auxiliary vector satisfies a finite precision version of the recurrence relation (6), which summarizes to
[TABLE]
where is the local rounding error which can again be bounded in terms of machine precision . Expression (34) can be formulated in matrix notation as:
[TABLE]
Furthermore, the following matrix relations hold between the scalar coefficients and in Alg. 1:
[TABLE]
Subsequently, using expressions (32), (33), (35) and (36) and it is derived that the gaps on the basis vectors are given by
[TABLE]
where should be interpreted as a Moore-Penrose (left) pseudo-inverse. Hence, the local rounding errors in this expression are possibly amplified by the entries of the matrix , which may lead to loss of attainable accuracy for the p()-CG method. The inexact Lanczos relation may in turn give rise to a growing gap between the computed and actual residual on the solution, see (2.3.2). It is clear from expression (37) that the conditioning of the matrix plays a crucial role in the rounding error propagation in the p()-CG algorithm as indicated in Section 2.2, see Remark 4.
3.2 *Toward stability by using the Lanczos relation *
Section 3.1 shows that the recurrence relation (31) is the main cause for the amplification of local rounding errors throughout the p()-CG algorithm. Moreover, the possibly ill-conditioned matrix that is used construct the basis , see expression (33), may be detrimental for convergence. A straightforward countermeasure would be to eliminate in the construction of the basis. This can be achieved by simply replacing the recurrence relation (31) by the original Lanczos relation, i.e., for :
[TABLE]
Here represents a local rounding error which is generally different from the error occurring in expression (31). Recurrence relation (38) can alternatively be written as
[TABLE]
with . The gap between the true basis vector and the computed basis vector then reduces to
[TABLE]
By using recurrence (38) for instead of relation (31) in Alg. 1, no amplification of local rounding errors occurs, see (40), and the influence of rounding errors on attainable accuracy remains limited. However, to use the recurrence relation (38) an additional spmv, i.e. , is computed in each iteration of the algorithm, leading to an undesirable increase in computational cost. Although the use of expression (38) would not hinder the ability to overlap the global reduction phase with computations (for pipeline length the global reduction would simply be overlapped with spmvs), we aim to avoid adding spmv computations to the algorithm.
The technique proposed by expression (38) shows similarity to the concept of residual replacement, which was suggested by several authors as a countermeasure to local rounding error propagation in various multi-term recurrence variants of CG [44, 3, 10]. While the idea is valuable, it cannot be implemented in the p()-CG method in its current form, i.e. using expression (38), due to the significantly augmented computational cost caused by the additional spmv in each iteration.
4 Deriving stable recurrence relations
We now present the core technique for obtaining a numerically stable variant of the recurrence relations used in the p()-CG algorithm by introducing additional auxiliary bases and corresponding recurrence relations. Sections 4.1-4.3 are again written in the exact arithmetic framework in order to derive the algorithm, interluded by a short discussion on computational costs and storage requirements in Section 4.2. We return to the finite precision framework for the analysis of the improved method in Section 4.4.
4.1 Derivation of a stable pipelined CG method
We introduce a total of bases, denoted by , where the upper index ‘’ (with ) labels the different bases and the lower index ‘’ indicates the iteration like before. The basis will denote the original Krylov subspace basis, that is: . The auxiliary basis vectors are defined identically to the basis in p()-CG, cf. (4), i.e. . In addition, we also define intermediary bases that will enable us to use a variant of the Lanczos relation (38) to recursively update , but without the necessity to compute the spmv . The auxiliary bases are defined as follows:
[TABLE]
where the polynomial is defined by (5). Note that the first vectors in basis are identical to the first vectors in all bases with . By definition (41) the ‘zero-th’ basis is simply the original basis , whereas the -th basis is the auxiliary basis from the p()-CG method, see Section 2.
A crucial relation connects each pair of bases and (for ). It holds that
[TABLE]
which translates into
[TABLE]
By multiplying the original Lanczos relation (3) for on both sides by the respective polynomial with and by exploiting the associativity of and , it is straightforward to derive that each auxiliary basis with satisfies a Lanczos type recurrence relation:
[TABLE]
for and . Note that when expression (44) yields the Lanczos relation (3) for , whereas setting results in the recurrence relation (6) for .
The recursive expressions (44) for the bases are not particularly useful in practice, since each recurrence relation requires to compute an additional spmv to form the next basis vector. However, using relation (43) the recurrence relations (44) can alternatively be written as:
[TABLE]
with and . We stress that only for , i.e. to compute the vectors in the auxiliary basis , we use the recursive update given by expression (44):
[TABLE]
for , which reduces to the recurrence relation (6). The recurrence relations (45) allow us to compute the vector updates for the bases without the need to compute any additional spmv. Adding the recurrence relations (45) for the auxiliary bases to the p()-CG method leads to the stable p()-CG method, Alg. 2.
Let us expound on Alg. 2 in some more detail. In the -th iteration of Alg. 2 each basis is updated by adding one vector. Thus the algorithm computes a total of new basis vectors, i.e.: , per iteration. For each basis, the corresponding vector update in iteration is computed as follows:
[TABLE]
Note that all vector updates make use of the same scalar coefficients , and that are computed in iteration of Alg. 2 (lines 11-17). We also remark that only one spmv, namely , is required per iteration to compute all basis vector updates, similar to Alg. 1.
The merit of the intermediate basis vector updates is that they allow to replace the recurrence (9) for the original basis vector by relation (44), which for yields:
[TABLE]
with . Since relation (43) with states that , expression (47) very closely resembles the finite precision Lanczos recurrence relation (38). In particular, we point out that the matrix does not occur in the recursive update (47) for . We clarify the difference between the finite precision and exact variant of recurrence relation (47) in Section 4.4.
Example 1**.**
To illustrate the methodology for constructing the basis in the stable p()-CG method, consider the case where the pipeline length . We formulate the improved p()-CG method following the derivation above. This scenario features the default Krylov subspace basis and three auxiliary bases: and . To compute a new vector for each auxiliary basis, the following recurrence relations are used by Alg. 2 in iteration :
[TABLE]
The final recurrence relation to update is identical to expression (6) with . The former recurrence relation for , expression (9), is replaced by the (stable) relation (47). This update explicitly uses the auxiliary variable and implicitly depends on the other auxiliary variables and through the respective recurrence relations. All four recurrence relations above make use of the same scalar coefficients and that form the last column of the matrix . Similar to Alg. 1 these coefficients are computed on line 11-17 in Alg. 2, right before the recursive vector updates.
Example 2**.**
In the case where the pipeline length is one, i.e. , the stable p()-CG Alg. 2 formally differs slightly from the original formulation of p()-CG, Alg. 1. The improved algorithm uses the following recurrence relations for and in iteration :
[TABLE]
where and . This implies that the only difference between the improved and original p()-CG method is the recurrence relation for . The recurrence relation for above is equivalent to the recurrence relation (9) in exact arithmetic, but it is numerically (more) stable as explained in Section 4.4.
4.2 Computational costs and storage requirements
We give an overview of implementation details of the stable p()-CG method, Alg. 2, including global storage requirements and number of flops (floating point operations) per iteration. We compare to the same properties for the former version of p()-CG Alg. 1 [11] and Ghysels’ p-CG method [19]. The latter algorithm, although mathematically equivalent to p()-CG, was derived in an essentially different way.
4.2.1 Floating point operations per iteration
All Conjugate Gradient variants listed in Table I compute a single spmv in each iteration. However, as indicated by the Time column, time per iteration may be reduced significantly by overlapping the global reduction phase with the computation of one or multiple spmvs. Time required by the local axpy and dot-pr computations is neglected, since these operations are assumed to scale perfectly as a function of the number of workers.
Comparing Alg. 1 to Alg. 2, it is clear that the latter requires an additional axpys per iteration to update the auxiliary vectors using the recurrence relations (45). However, the recurrence relation to update , expression (47), only requires axpy operations, instead of the axpys required to update in Alg. 1 using expression (9). Both algorithms furthermore compute local dot products (see Remark 2) to form the matrix and use two additional axpy operations to update the search direction and the iterate . In summary, as indicated by the Flops column in Table I, both p()-CG algorithms use a total of flops in each iteration. The recurrence relations in the p()-CG algorithm can thus be stabilized at no additional computational cost using the framework outlined in Section 4.1.
4.2.2 Global storage requirements
Section 4.4 proves that the stable p()-CG method, Alg. 2, is extremely resilient to the presence of local rounding errors in the recurrence relations. However, this stability comes at a slightly increased storage cost compared to Alg. 1. The latter requires to store vectors of the basis (required for vector updates), vectors of the auxiliary basis (for vector updates and dot product computations, see also Remark 2), and the vector at any time during the execution of the algorithm from iteration onward. In contrast, Alg. 2 stores the three most recently updated vectors in each of the bases (which include the bases and ). In addition, vectors in the basis need to be stored for dot product computations. Table I summarizes the storage requirements for different variants of the CG algorithm. Alg. 1 keeps a total of vectors in memory at any time during execution, whereas Alg. 2 stores vectors. The memory overhead for the stable p()-CG method thus amounts to a modest vectors.
4.3 A stable preconditioned p()-CG algorithm
Preconditioning is indispensable to efficiently solve linear systems in practice. We briefly comment on the straightforward extension of Alg. 2 to include a preconditioner. This section follows the standard methodology that was described for pipelined CG in [19] and for Alg. 1 in [11].
Let the preconditioner be given by the matrix . We aim to solve the left-preconditioned linear system , where and are both symmetric positive definite matrices. This assumption does not necessarily imply that is symmetric. Nonetheless, symmetry can be preserved by observing that is self-adjoint with respect to the inner product . The basic strategy is thus to replace all Euclidean dot products occurring in Alg. 2 with inner products.
We cannot simply use the matrix to calculate these inner products, since the preconditioner inverse is in general not available. However, by introducing the unpreconditioned auxiliary variables and observing that these variables again satisfy a Lanczos type relation:
[TABLE]
this obstacle is circumvented. Using these unpreconditioned auxiliary variables , the dot products for can be computed as follows. For it holds that
[TABLE]
with . For we find
[TABLE]
This allows to formulate a preconditioned version of Alg. 2 by adding the recurrence relation (48) for the unpreconditioned auxiliary variables , and replacing the dot products on line and by expressions (49) and (50) respectively. From an implementation point of view the extension to the preconditioned algorithm only requires the application of the preconditioner , two additional axpy operations and storage of three additional vectors in memory.
4.4 Rounding error analysis for the improved method
We now consider the finite precision equivalents of the above recurrence relations for all basis vectors in the improved version of p()-CG, Alg. 2. The actually computed finite precision variants of the basis vectors and scalar variables are again denoted by bars.
Let the true basis vector satisfy the actual Lanczos relation without local roundoff (for )
[TABLE]
and let, analogously, the true auxiliary basis vectors be defined (for and ) as
[TABLE]
On the other hand, the finite precision variants of the recurrence relations (45), which are actually computed in Alg. 2, are for and given by the expressions
[TABLE]
whereas for the following finite precision relation holds (with ):
[TABLE]
The local rounding errors in the above expressions can be bounded by machine precision multiplied by the norms of the respective vectors, i.e. .
Switching to matrix forms to simplify the notation, we write the Lanczos relations (52) as
[TABLE]
for and . Note that this expression neglects the “initial” gaps, i.e. the quantities for , consisting of the local rounding errors from the application of the matrix polynomials to the initial vector , which are computed explicitly in the stable p()-CG method, Alg. 2 (line 3). The recurrence relations (53) are summarized by the matrix expressions
[TABLE]
with and , where and . The recurrence relation (54) for can be written as
[TABLE]
with . To find an expression for the gap on the basis vectors in , i.e. the gap , we now progressively compute the gaps on the auxiliary bases. Starting from , we compute , in which we substitute the gap on , followed by , etc., until we eventually obtain an expression for the gap . For we derive from (55) and (57) that
[TABLE]
As indicated by this expression, the gaps on the basis vectors in the basis thus consist of local rounding errors only. The following auxiliary lemma can easily be proven by induction.
Lemma 1**.**
Let be defined by (56) and be defined by (55). Then it holds for and that
[TABLE]
where with a local rounding error that is bounded by and .
Next, we combine expressions (55) and (56) and Lemma 1 for the case . We obtain (with )
[TABLE]
Hence, the gaps on the basis vectors are coupled to the gaps on the basis vectors . After substitution of expression (58) it is clear that consists only of local rounding errors. The above relation can be generalized for any as follows:
[TABLE]
where . After subsequent substitution of this expression starting from up until the characterization of , see (58), it appears that the gap on the original Krylov subspace basis is just a sum of local rounding errors. No rounding error propagation takes place in the stable p()-CG method, Alg. 2, on any of the (auxiliary) basis vectors, see expressions (58)-(4.4). By introducing the intermediate auxiliary bases for the recursive computation of the basis , the dependency of the basis vector gaps on the possibly ill-conditioned matrix , cf. expressions (33) and (37), is thus removed, resulting in a numerically stable algorithm.
5 Numerical results
We present various numerical experiments to benchmark the stable p()-CG method proposed in Section 4. The benchmark problems exemplify both the performance of the improved p()-CG method on large scale parallel hardware as well as its error resilience compared to other CG variants. Performance measurements result from a PETSc [1] implementation of the p()-CG algorithm on a distributed memory machine using the message passing paradigm (MPI).
5.1 Hardware and software specifications
Parallel performance experiments are performed on up to 128 compute nodes of a cluster consisting of two 14-core Intel E5-2680v4 Broadwell generation CPUs each (28 cores per node). Nodes are connected through an EDR InfiniBand network. We use PETSc version 3.8.3 [1]. The MPI library used for this experiment is Intel MPI 2018v3. The PETSc environment variables MPICH_ASYNC_PROGRESS=1 and MPICH_MAX_THREAD_SAFETY=multiple ensure optimal parallelism by allowing for asynchronous non-blocking global communication. Timing results reported in this manuscript are the most favorable results (smallest overall run-time) over 3 individual runs of each method. Experiments also show results for Ghysels’ p-CG method [19] as a reference. The p-CG method is similar to p()-CG in operational cost (see Table I), but features a significant loss of attainable accuracy due to rounding error propagation in its recurrence relations [10], similar to p()-CG, Alg. 1.
5.2 Benchmark (B1): 2D Laplace PDE
Fig. 2 is the analogue of the experiment reported in Fig. 1 where the p()-CG Alg. 1 is replaced by the improved Alg. 2. The model problem solved is the 2D Laplace equation
[TABLE]
with homogeneous Dirichlet boundary conditions, discretized using second order finite differences on a uniformly spaced grid. In contrast to Fig. 1(a), no loss of accuracy due to local rounding error amplification in the recurrence relations is observed in Fig. 2(a). The stabilized recurrences (45)-(46) ensure that the quantity is of order machine precision. The true residual norms (full lines) and recursively computed residual norms (dotted lines) coincide up to e- (maximal attainable accuracy) for all methods in Fig. 2(a). Fig. 2(b) quantifies the inexact orthogonality for the different methods which is comparable to Fig. 1(b). Hence, similarly to Fig. 1(a) a delay of convergence may be observed in Fig. 2(a).
Fig. 3 shows a performance experiment on the hardware and software setup specified above. A linear system resulting from discretization of the 2D Laplace equation (61) with exact solution , right-hand side and initial guess is solved. This problem is available as example in the PETSc Krylov subspace solvers (KSP) folder. The simulation domain is discretized using a uniform finite difference mesh ( unknowns). No preconditioner is applied. For p()-CG Chebyshev shifts are used based on the interval (known analytically), see (24).
Fig. 3(a) shows the speedup achieved by different CG methods over single-node classic CG for various pipeline lengths and node setups. Classic CG scales poorly for this model problem; no speedup is achieved beyond nodes. The pipelined methods scale well. The length one p-CG and p()-CG method achieve a relative speedup of approximately compared to classic CG when both are executed on nodes. The longer pipelined p()-CG and p()-CG methods out-scale the latter method, with p()-CG achieving a speedup relative to classic CG on nodes. When the global communication phase in each iteration is only partially ‘hidden’ behind the spmv computation, whereas overlapping with more than two spmvs by using pipelines length does not seem to improve performance further. Pipeline length is optimal for this problem, striking a good balance between overlapping communication and introducing additional auxiliary vectors.
Fig. 3(b) plots the relative residual norms as a function of the total time spent (in s.) by various CG algorithms (on 500 iteration intervals). The p-CG method stagnates around e- and is unable to attain a better accuracy regardless of computational effort. The stable p()-CG methods all are able to attain a much higher accuracy, stagnating around e- for . Maximal attainable accuracy is reached in 2.01 s. for p()-CG, 1.28 s. for p()-CG, 1.34 s. for p()-CG, 1.74 s. for p()-CG, 2.73 s. for p()-CG. Note that classic CG attains e- in s. (outside the graph). These results are consistent with Fig. 3(a), indicating that a pipeline length of suffices to hide the global communication phase for this problem.
5.3 Benchmark (B2): 3D Hydrostatic Ice Sheet Flow
Fig. 4(a) shows the result of two strong scaling experiments for the 3D Hydrostatic Ice Sheet Flow problem, see [2] for a full problem description. An implementation of this problem is available as example is the PETSc Scalable Nonlinear Equations Solvers (SNES) folder. The Blatter/Pattyn equations are discretized using (Fig. 4(a)) and (Fig. 4(b)) finite elements respectively. Hard- and software specifications are presented in Section 5.1. A Newton-Krylov outer-inner iteration is used to solve the non-linear problem. The CG methods used as inner solver are combined with a block Jacobi preconditioner333Note: To attain good parallel scaling using pipelined Krylov methods it is generally beneficial to choose a preconditioner which does not require global communication. We hence use (block) stationary iterative methods as preconditioners that only use spmvs and axpys but avoid computing dot products in both examples (B2) and (B3). (one block Jacobi step per CG iteration; one block per processor). The relative tolerance of the inner solvers is set to e- (Fig. 4(a)) and e- (Fig. 4(b)), while the outer relative tolerance is chosen to be e-. Seven Newton iterations are needed to reach the outer tolerance. Chebyshev shifts are based on the interval , which is chosen in an ad hoc fashion for this problem based on the presumed clustering of the spectrum around .
In Fig. 4(b) the scalability of p()-CG for is comparable to that of the p-CG method. On nodes a speedup of approximately is measured for the pipelined methods compared to classic CG on nodes. There is no gain in using longer pipeline lengths for this problem and hardware setup, indicating that the amount of computational work of a single iteration (spmv + prec) suffices to hide the communication in the global reduction in each iteration. It is expected that longer pipeline lengths out-scale the methods with shorter pipelines on very large numbers of nodes, since communication costs would increase accordingly. This is illustrated to some extend by the smaller problem reported in Fig. 4(a), which shows that for heavily communication bound problems the use of longer pipelines is beneficial for this benchmark. Fig. 4(a) indicates that depending on the amount of hardware parallelism (number of nodes) p()-CG (4-24 nodes), p()-CG (32-40 nodes) or p()-CG (48+ nodes) respectively display the biggest speedup.
An accuracy experiment for the FE benchmark problem is shown in Fig. 4(c). The experiment is run on nodes and the relative tolerance of the inner solver is e-. A relatively small delay of convergence is observed for the p()-CG method, with the effect worsening for longer pipelines. The convergence delay is due to restarts (caused by a square root breakdown) which come forth from the ill-conditioned auxiliary basis . The effect is negligible since the maximal total delay after seven Newton iterations is iterations (comparing p()-CG to CG), relative to a total of Krylov iterations. The p-CG method fails to reach the inner tolerance e-. As shown by the analysis in Section 4.4, no rounding error propagation occurs in the stable p()-CG algorithm and the method is able to attain a highly accurate solution.
5.4 Benchmark (B3): 2D Bratu Solid Fuel Ignition
The 2D Bratu Solid Fuel Ignition problem results from a finite difference discretization of the nonlinear PDE
[TABLE]
with homogeneous Dirichlet boundary conditions on a uniformly spaced 2D grid. Its implementation can be found in the PETSc SNES folder as example 5. The two-dimensional domain is discretized with uniformly spaced grid points. The Bratu parameter is set to , implying significant non-linearity. Different CG methods are used as the inner solver in the Newton-Krylov scheme. The preconditioner is a SOR-preconditioned Chebyshev iteration (three SOR steps per Chebyshev iteration; one Chebyshev step per CG iteration). Outer solver relative tolerance is set to e-, which is attained by all methods after three Newton steps. For p()-CG Chebyshev shifts based on the interval are used. A relative tolerance of e- is imposed on the inner solvers.
Fig. 5(a) presents the relative speedup of the p()-CG methods for compared to classic CG on one node. Timing data for the p-CG method is not included in the figure since it is unable to achieve the desired accuracy; the maximal accuracy attainable by p-CG is e-/e-/e- for outer SNES iteration respectively. On nodes the speedup achieved by the pipelined methods over classic CG is approximately . For this small sized problem the p()-CG method does not scale beyond 32 nodes since CPU times are dominated by the #nodes behavior of the global reduction phases in this regime. Longer pipelines do not yield a higher speedup compared to pipeline length due to the cost of the preconditioner, which requires two spmvs each iteration.
Fig. 5(b) shows the relative residual norm as a function of the total time spent for solving the three Newton iterations on a node setup. The timings and residual norms are taken from the same experiment as Fig. 5(a). The total time-to-solution is divided by the respective number of iterations for the different CG methods to accurately compare them for accurate comparison on a single graph. The p()-CG methods outperform the classic CG method despite a small increase in total number of iterations for longer pipeline lengths. The stable p()-CG algorithm reaches the relative tolerance e- for any choice of the parameter .
6 Conclusions
This work presents a redesigned algorithmic variant of the -length pipelined Conjugate Gradient method, p()-CG for short. The main improvement over former pipelined CG variants is the significantly improved maximal attainable accuracy that is attained by the new algorithm. More specifically, it is shown analytically and verified experimentally that the stable p()-CG method attains the same precision on the solution that is attainable by the classic CG method. By introducing intermediate auxiliary bases the propagation of local rounding errors in the recurrence relations is eliminated, allowing for high-precision solution independently of the choice of the pipeline length . The new p()-CG algorithm is elegant in the sense that it has no additional computational overhead and only minor additional storage requirements compared to previous versions of the p()-CG algorithm. The increased stability thus comes without the cost of increased complexity. The improved algorithm effectively replaces former (less stable) pipelined CG variants. As such, this work resolves one of the major numerical issues that has restricted the practical usability of pipelined Krylov subspace methods since their initial development.
Generalizing the stabilization technique proposed in this work to other pipelined methods is non-trivial. A similar methodology could be applied to the existing pipelined GMRES method, p()-GMRES [18]. However, practical restrictions limit the viability of this approach. Notably, the full storage of an additional auxiliary bases would be required, which can be assumed to be unfeasible for large-scale applications. An -length variant of the pipelined BiCGStab method [9] is currently not available, but the proposed technique could be promising for direct application in the context of Bi-Conjugate Gradient methods.
The research presented in this manuscript provides vital advancements towards establishing a numerically stable communication hiding variant of the Conjugate Gradient method. However, it is well-known that Krylov subspace methods may also suffer from delay of convergence due to loss of basis orthogonality. Our experiments indicate that this effect is typically enlarged as a function of pipeline length. Analyzing the stability issues related to loss of orthogonality deserves to be treated as part of future work.
Acknowledgments
The authors are grateful for the funding received that supported this work. In particular, J. C. acknowledges funding by the University of Antwerp Research Council under the University Research Fund (BOF) and S. C. is funded by the Flemish Research Foundation (FWO Flanders) under grant 12H4617N. We also cordially thank Pieter Ghysels for providing useful comments on an earlier version of this manuscript.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman Mc Innes, K. Rupp, B.F. Smith, S. Zampini, and H. Zhang. PET Sc Web page. www.mcs.anl.gov/petsc , 2015.
- 2[2] J. Brown, B. Smith, and A. Ahmadia. Achieving textbook multigrid efficiency for hydrostatic ice sheet flow. SIAM Journal on Scientific Computing , 35(2):B 359–B 375, 2013.
- 3[3] E. Carson and J. Demmel. A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods. SIAM Journal on Matrix Analysis and Applications , 35(1):22–43, 2014.
- 4[4] E. Carson, N. Knight, and J. Demmel. Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods. SIAM Journal on Scientific Computing , 35(5):S 42–S 61, 2013.
- 5[5] E. Carson, M. Rozlozník, Z. Strakoš, P. Tichỳ, and M. Tma. The numerical stability analysis of pipelined Conjugate Gradient methods: Historical context and methodology. SIAM Journal on Scientific Computing , 40(5):A 3549–A 3580, 2018.
- 6[6] A.T. Chronopoulos and C.W. Gear. s-Step iterative methods for symmetric linear systems. Journal of Computational and Applied Mathematics , 25(2):153–168, 1989.
- 7[7] A.T. Chronopoulos and A.B. Kucherov. Block s-step Krylov iterative methods. Numerical Linear Algebra with Applications , 17(1):3–15, 2010.
- 8[8] S. Cools. Analyzing and improving maximal attainable accuracy in the communication hiding pipelined Bi CG Stab method. Parallel Computing, PMAA’18 Special Issue, accepted for publication, in press , 2019.
