Linear systems solvers - recent developments and implications for lattice computations
Andreas Frommer (Department of Mathematics, University of Wuppertal,, Germany)

TL;DR
This paper reviews recent advances in Krylov subspace methods for solving non-Hermitian linear systems, emphasizing their near-optimal performance for lattice gauge theory computations and highlighting preconditioning as a key area for future improvements.
Contribution
It analyzes the effectiveness of mature Krylov methods like QMR, BiCGStab, and GMRES for Wilson fermion matrices, stressing the importance of preconditioning.
Findings
Krylov methods are near-optimal for Wilson fermion matrices
Preconditioning is crucial for further improvements
Mature methods like QMR, BiCGStab, GMRES are effective
Abstract
We review the numerical analysis' understanding of Krylov subspace methods for solving (non-hermitian) systems of equations and discuss its implications for lattice gauge theory computations using the example of the Wilson fermion matrix. Our thesis is that mature methods like QMR, BiCGStab or restarted GMRES are close to optimal for the Wilson fermion matrix. Consequently, preconditioning appears to be the crucial issue for further improvements.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Linear systems solvers – recent developments and implications
for lattice computations
A. Frommer
Fachbereich Mathematik, Universität Wuppertal,
42097 Wuppertal, Germany
Abstract
We review the numerical analysis’ understanding of Krylov subspace methods for solving (non-hermitian) systems of equations and discuss its implications for lattice gauge theory computations using the example of the Wilson fermion matrix. Our thesis is that mature methods like QMR, BiCGStab or restarted GMRES are close to optimal for the Wilson fermion matrix. Consequently, preconditioning appears to be the crucial issue for further improvements.
1 KRYLOV SUBSPACE METHODS
Given a linear system of equations
[TABLE]
with being non-singular, the class of Krylov subspace iterative methods for solving (1) is characterized by the following generic template
[TABLE]
Here, is a polynomial of degree . For the residual we therefore get
[TABLE]
where is the polynomial . In an algorithmic description of virtually any Krylov subspace method, the polynomials or are not explicitly present, but they are crucial to a theoretical analysis of the method. Moreover, the relation (2) is also the key to understanding the condition (‘difficulty’) of the linear system to be solved and we start by discussing this point.
1.1 Condition
Assume that is diagonaizable, i.e. we have a decomposition of the form
[TABLE]
where is diagonal with its diagonal containing the eigenvalues, and is the corresponding matrix of (right) eigenvectors. Denoting the spectrum of by , from (2) we now get and therefore
[TABLE]
Since is diagonal we have
[TABLE]
Considering and as constants, the best bound any Krylov subspace method can achieve in (3) is the one obtained for the polynomial which minimizes . In this sense, the quantities
[TABLE]
represent a measure of the condition of the system (1), since no Krylov subspace method can achieve a better bound in (2) than the one which replaces by . Finding the optimal polynomial in (4) is a complex approximation problem for which solutions are known only in special cases. However, it is clear that due to the restriction the numbers will tend to zero only slowly if there are many eigenvalues close to 0, particularly if they are distributed quite evenly around [math].
1.2 Optimal methods
A Krylov subspace method is feasible algorithmically if it requires only a finite amount of ressources like storage and computer time. We express this fact by saying that the method can be implemented using short recurrencies, meaning that all quantities needed at iteration can be computed from those of a small number of previous iterations. Note that each Krylov subspace method will require at least one multiplication with per iteration to account for the fact that the degree of the polynomial will increase as the iteration proceeds. The following theorem [1] shows that optimality and short recurrencies can only be achieved for a restricted class of matrices.
Theorem 1.1
A Krylov subspace method which achieves optimality, i.e.
[TABLE]
for every initial residual and which can be implemented using short recurrencies exists only if is of the form
[TABLE]
This theorem also holds if is replaced by an energy norm of the form with hermitian and positive definite. For hermitian and positive definite the CG method achieves optimality in the energy norm with . For hermitian (but possibly indefinite), MINRES [2] is optimal in the Euclidian norm. The paper [3] gives algorithmic descriptions for optimal methods in the other cases of Theorem 1.1. Note that the above theorem includes matrices of the form with (take ), arising for staggered fermions.
2 NON-HERMITIAN SYSTEMS
The last 10 years have seen tremendous progress in Krylov subspace methods for solving linear systems which, like the Wilson fermion matrix, do not fall into the category covered by Theorem 1.1. See [4, 5] for an overview. For simplicity, such systems will just be termed ‘non-hermitian’ in the sequel. In these cases one must find an adequate compromise between the quality of the Krylov subspace method to use and the ressources required by the method.
The first method of this kind is the BiCG method [6]. Here, an additional shadow residual is selected and the -th iterate is defined by the Galerkin condition
[TABLE]
for all polynomials of degree . In case that is hermitian positive definite and the method reduces to the CG method. BiCG needs two matrix multiplies (one with and one with ) per iteration and the residuals typically undergo quite large variations. Moreover, there are situations where the method breaks down (due to division by zero) without having reached a solution. Although exact breakdowns do rarely occur in practice, near breakdowns severely affect the numerical stability.
2.1 QMR
QMR, the quasi minimal residual method of [7], can be regarded as one way to make BiCG more reliable. As BiCG it is based upon the non-symmetric Lanczos process to compute an appropriate basis of the Krylov subspace . The -th residual is characterized by the coefficient vector in having minimal norm subject to the condition . If the Lanczos vectors were orthogonal this would imply that is minimal. Since for non-hermitian matrices the Lanczos vectors are not orthogonal, minimizing the coefficient vector merely implies a ‘quasi’ minimality of whence the name QMR. QMR eliminates one source of breakdowns present in BiCG. Moreover, using a look-ahead strategy in the non-symmetric Lanczos process, almost all other (exact or near) breakdowns are also avoided at the price of extra storage. All these features are implemented in QMRPACK which is available from netlib. As in BiCG each iteration costs one multiply with and one with . The quite smooth convergence of QMR is also justified by the theoretical analysis.
2.2 -hermitian matrices
A matrix is said to be -hermitian if there exists a matrix such that
[TABLE]
In this particular case, the non-symmetric Lanczos process can be made less costly, since through the right choice of the ‘shadow residual’ all multiplications with can be replaced by multiplications with [8]. Consequently, BiCG and QMR require only one multiply with and one with in each iteration. For the Wilson fermion matrix we have and thus multiplies with are by far more cheaper than with . Exploiting the -symmetry thus makes QMR (and BiCG) competitive to the other methods discussed in this section, see [9, 10]. At the time of writing this article, including the -hermitian case into QMRPACK was under preparation [11] but not yet completed.
2.3 BiCGStab
The BiCGStab [12] method is another way to stabilize BiCG. Here, multiplies with are replaced by multiplies with such that an additional one-dimensional minimization process is performed during each iteration.
All computational effort, in particular, all matrix multiplies is spent working on the iterates of the system to solve. Typically, BiCGStab produces less varying residuals than BiCG, although the same sources for breakdowns are still present. BiCGStab is quite easy to implement ‘from scratch’. Some variations are described in [13, 14]
2.4 Restarted GMRES
In contrast to the Lanzcos process, the Arnoldi process computes an orthogonal basis of for a general non-hermitian matrix . From the Arnoldi basis it is possible to calculate an optimal iterate (such that satisfies (5) ) by solving a small least squares problem.
The resulting method is called GMRES, the generalized minimal residual method [15]. However, the Arnoldi process does not rely on short recurrencies requiring vectors of storage and inner products to be computed.
One therefore has to stop GMRES after a certain number (, say) of iterations and restart the process with the current iterate as a new initial guess. The resulting method is termed restarted GMRES or GMRES(). For , a restart is done after every iteration. Hence, GMRES(1) is identical to the familiar MR method [16], where the iterate is obtained by minimizing with respect to . There are situations where GMRES() stagnates without reaching a solution, even for large restart values , but if all eigenvalues of lie in the right half plane GMRES() is known to converge for all [5, 15].
3 PRECONDITIONING
We have seen in Section 1 that the eigenvalue distribution of determines a bound on the maximal speed of any Krylov subspace method for . Once we have a method which is close to optimal, the only way of getting further improvement is to change the matrix to one for which the eigenvalue distribution is more favorable. This is precisely the purpose of preconditioning where the original system is changed to
[TABLE]
with and . The matrices are called the left and right preconditioner, resp., and their product is often referred to as the preconditioner. Note that the spectrum of is identical to that of , so that the effect of preconditioning on the eigenvalue distribution is determined by alone but not by its factorization . A preconditioner should approximate (so that the eigenvalues of cluster around 1). On the other hand, performing a Krylov subspace method on the preconditioned system requires multiplies with the preconditioned matrix like in which are normally obtained via
[TABLE]
Preconditioning thus introduces additional solves with the matrices and and this overhead should not be too expensive in order to get an efficient method. A good preconditioner is always a compromise between the latter requirement and the fact that should well approximate .
Conceptually, one may distinguish between two types of preconditioners: In problem oriented preconditioners the matrix is taken as a simpler or reduced (with respect to ) representation of the underlying physical problem. Algebraic preconditioners are obtained directly from without recourse to the application from which arises. Interestingly, algebraic preconditioners seem to be more successful than problem oriented ones in QCD computations and we therefore focus on the latter ones.
3.1 SSOR preconditioners
Each matrix can be decomposed into
[TABLE]
where and represent the diagonal, the stricly lower and the stricly upper triangular part of . We assume that has all diagonal elements , so that and are all non-singular. For a given relaxation parameter the SSOR preconditioner is defined by (see [5], e.g.)
[TABLE]
For we thus have as an approximation to . Systems with the preconditioner are easy to solve because and are triangular so that in can be obtained by a simple forward recursion, and similarly by a backward recursion in . Note that the situation becomes more involved if we consider parallelization issues since recursions are known to parallelize badly.
Assume that is of the particular form
[TABLE]
This is the case for the Wilson fermion matrix if we use the standard odd-even ordering (with ). If we take and we get
[TABLE]
For the Wilson fermion matrix the second diagonal block in (7) is commonly called the odd-even reduced system. Our discussion shows that odd-even reduction is nothing else but the SSOR preconditioning with respect to the odd-even ordering and with . Very exceptionally, in this case it is of no harm to calculate the preconditioned matrix explicitly as done in (7), whereas in general this produces too much fill-in to be practicable. If we re-interprete as block parts of , the above discussion can also be used to derive block SSOR preconditioners. In QCD this can be useful in the context of improved actions where then is block diagonal with blocks of size . See also [17, 18]
3.2 ILU factorizations
The incomplete LU factorization (ILU) (see [5], e.g.) is another algebraic method to obtain a preconditioner for where, again, are diagonal, strictly lower and strictly upper triangular, respectively. These matrices are obtained by performing a variant of Gaussian elimination on imposing restrictions on the amount of fill-in in the factors and so that represents only an approximate (incomplete) factorization of . If we allow for no fill-in (i.e. and have the same sparsity structure as ) and if represents a nearest neighbor-coupling on a regular grid, then and , so that the only difference to the SSOR preconditioner resides in the diagonal part . For the Wilson fermion matrix with Wilson parameter both preconditioners turn out to be totally equal. ILU preconditioners are often somewhat more efficient than SSOR preconditioners, but note that they require a start-up phase to compute (and and , in general).
3.3 The Eisenstat trick
If we have an SSOR or ILU preconditioner of the form and , the product can be computed as
[TABLE]
As far as flop counts are concerned, the above scheme is as expensive as one multiplication with itself, except for some additional operations involving diagonal matrices which can usually be neglected. So, due to the Eisenstat trick, the ILU and SSOR preconditioners do not increase the amount of work per iteration, thus making these preconditioners particularly attractive. Note that the Eisenstat trick can also be applied in more general situations, see [19].
3.4 The influence of orderings
When writing down the equation we are free to chose any ordering for the variables, and the change from one ordering to another translates into a transformation of the kind with a permutation matrix. For both, the SSOR and ILU preconditioners, the spectrum of the preconditioned matrix depends on the ordering chosen (but the Eisenstat trick can always be applied). There is therefore a potential to optimize these preconditioners using the best ordering. Typically, orderings which yield good preconditioners make the recurrencies in solving the triangular systems less amenable to parallel implementations. For example, the natural lexicographic ordering of lattice points in the Wilson fermion matrix was shown to yield a high quality ILU preconditioner [20], but it cannot be handled efficiently on a distributed memory parallel computer. In [21] it was shown that a new locally lexicographic ordering can yield up to a factor 2 improvement over odd-even preconditioning on a Quadrics parallel computer.
3.5 Polynomial preconditioning
Another algebraic preconditioner is obtained by taking where is a polynomial such that approximates . So the multiplication with requires multiplies with . Consequently, steps on the original system are as expensive as one step on the preconditioned system and the iterates are from the same Krylov subspace. In this respect polynomial preconditioning therefore offers little advantage, but it was shown in [17] that it can be useful as a mean of stabilizing the MR method in certain situations.
4 EXAMPLE: WILSON FERMIONS
The generic form of the Wilson fermion matrix is where represents the nearest neighbor coupling on the space-time lattice. Taking the even-odd ordering, has the form
[TABLE]
The odd-even reduced matrix from (7) is
[TABLE]
A typical example of the eigenvalue distribution for and (calculated from a confined configuration on a small -lattice at and ) is given on top of Fig. 1. Note that all eigenvalues lie in the right half plane so that GMRES() is known to converge for all . A number is an eigenvalue of if and only if it is of the form where is an eigenvalue of . We write for the smallest real part of an eigenvalue of .
Both, and are -symmetric, and we denote the respective symmetrized systems by
[TABLE]
and are both hermitian and half of their eigenvalues are negative and half are positive. Moreover, the eigenvalue plots given in Fig. 1 show that except for a pair close to zero the eigenvalues are quite evenly distributed in two intervals symmetric to the origin, denoted for . Finally, if we consider , then its eigenvalues are just the squares of those of and are therefore distributed in the interval .
With the information of Fig. 1 as a background, we can now start to discuss the condition of the different matrices, i.e. the numbers from (4). First of all we realize that odd-even preconditioning really makes the spectrum of and ‘nicer’, since eigenvalues are mapped away from 0 and are more clustered. We thus restrict the subsequent discussion to the even-odd preconditioned matrices. For the hermitian matrices and , rather good bounds for can be derived via the Chebyshev polynomials on , and we obtain
[TABLE]
Since MINRES is a feasible optimal method for and CG is an optimal method for , we also can take the above numbers as an approximate measure for the performance of these methods. They indicate that CGNR, the CG method applied to the normal equations would require half as many iterations as MINRES for . So in terms of matrix multiplies with (or ) – which is the computationally dominating part –, both methods should be comparable. Fig. 2 gives some experimental data, where we show the convergence history of MINRES and CGNR plotting the residual norm against the number of matrix mulitplies. We see that MINRES actually performs somewhat better than CGNR. The data comes from a confined configuration on a lattice at and which yields a relative quark mass of 0.02, approx. (In order to observe substantial differences between different methods it is important to work on ‘difficult’ problems, i.e. with small relative quark masses.)
For the non-hermitian matrix it is not possible to give an accurate bound on , but we know at least that
[TABLE]
and MR (= GMRES(1)) already achieves [5, 16]. The remaining parts of Fig. 2 give the convergence history for GMRES() for several values of for the same configuration as before as well as the corresponding results for BiCG, QMR and BiCGStab. In BiCG and QMR we made use of the savings due to -symmetry. One immediately notices the more erratic behavior of BiCG and BiCGStab. We also see that increasing in GMRES() gives significant improvement, but there seems little use taking larger than 8. Finally, QMR, BiCG, BiCGStab perform best and quite comparably which we can interprete as an indication that they are all close to optimal for . This observation is backed by results from [22] proving that even the full GMRES method did not give substantial improvement over BiCG, QMR or BiCGStab.
Acknowledgement
I am grateful for the continuing excellent cooperation with K. Schilling’s group at Wuppertal and Jülich, particularly to Th. Lippert. The numerical results were obtained together with P. Fiebach from the Department of Mathematics in Wuppertal.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. Faber and T. Manteuffel, SIAM J. Numer. Anal. 21 (1984) 352.
- 2[2] C. Paige and A. Saunders, SIAM J. Numer. Anal. 16 (1975) 617.
- 3[3] R. Freund, Numer. Math. 57 (1990), 285.
- 4[4] R. Freund, G. Golub and N. Nachtigal, Acta Numerica (1991) 57.
- 5[5] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS, Boston, 1996.
- 6[6] R. Fletcher, pp. 73-89 in Lecture Notes in Mathematics 506, (G. Watson, ed.), Springer, Berlin, 1975.
- 7[7] R. Freund, N. Nachtigal, Numer. Math. 60 (1991) 315.
- 8[8] R. Freund, pp. 33-47 in Proceedings of the Cornelius Lanczos International Centenary Conference (J. Brown et al., eds), SIAM, Philadelphia, 1994.
