A Parallel SSOR Preconditioner for Lattice QCD

S. Fischer; A. Frommer; U. Glaessner; S. Guesken; H. Hoeber; Th.; Lippert; G. Ritzenhoefer; K. Schilling; G. Siegert; A. Spitz

arXiv:hep-lat/9608066·hep-lat·October 28, 2009

A Parallel SSOR Preconditioner for Lattice QCD

S. Fischer, A. Frommer, U. Glaessner, S. Guesken, H. Hoeber, Th., Lippert, G. Ritzenhoefer, K. Schilling, G. Siegert, A. Spitz

PDF

TL;DR

This paper introduces a parallel SSOR preconditioner for lattice QCD that significantly reduces iteration counts and CPU time compared to traditional methods, enhancing computational efficiency.

Contribution

The paper presents a novel parallel SSOR preconditioning scheme tailored for lattice QCD, improving convergence and computational speed over existing odd-even preconditioning methods.

Findings

01

Reduces iteration count by a factor of 2

02

Achieves 30-70% CPU time savings

03

Effective in Hybrid Monte Carlo and quark propagator calculations

Abstract

A parallelizable SSOR preconditioning scheme for Krylov subspace iterative solvers in lattice QCD applications involving Wilson fermions is presented. In actual Hybrid Monte Carlo and quark propagator calculations it helps to reduce the number of iterations by a factor of 2 compared to conventional odd-even preconditioning. This corresponds to a gain in cpu-time of 30\% - 70\% over odd-even preconditioning.

Equations10

M x = ϕ .

M x = ϕ .

V_{1}^{- 1} M V_{2}^{- 1} \tilde{x} = \tilde{ϕ}, \tilde{ϕ} = V_{1}^{- 1} ϕ, \tilde{x} = V_{2} x .

V_{1}^{- 1} M V_{2}^{- 1} \tilde{x} = \tilde{ϕ}, \tilde{ϕ} = V_{1}^{- 1} ϕ, \tilde{x} = V_{2} x .

M = I - L - U .

M = I - L - U .

V_{1} = I - L, V_{2} = I - U .

V_{1} = I - L, V_{2} = I - U .

v = V_{2}^{- 1} r, u = V_{1}^{- 1} (r - v), w = v + u .

v = V_{2}^{- 1} r, u = V_{1}^{- 1} (r - v), w = v + u .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Parallel SSOR Preconditioner for Lattice QCD††thanks:

Talk presented by G. Ritzenhöfer

S. Fischerb, A. Frommerb, U. Glässnerb, S. Güskenb, H. Hoebera, Th. Lipperta, G. Ritzenhöfera, K. Schillinga,b, G. Siegerta, A. Spitzb.

aHLRZ c/o Forschungszentrum Jülich, D-52425 Jülich, and DESY, D-22603 Hamburg, Germany,

bPhysics Department, University of Wuppertal, D-42097 Wuppertal, Germany

Abstract

A parallelizable SSOR preconditioning scheme for Krylov subspace iterative solvers in lattice QCD applications involving Wilson fermions is presented. In actual Hybrid Monte Carlo simulations and quark propagator calculations it helps to reduce the number of iterations by a factor of 2 compared to conventional odd-even preconditioning. This corresponds to a gain in cpu-time of 30% - 70% over odd-even preconditioning.

1 INTRODUCTION

Efficient numerical algorithms to solve huge sparse systems of linear equations are needed to reduce the enormous computer power required in lattice QCD computations. In the simulation of full QCD, as well as in the calculation of Greens functions or propagators to determine the properties of hadrons, as e.g. the spectrum, weak decay constants or weak matrix elements, the computational bottleneck is to determine the solution of the discretized Dirac equation:

[TABLE]

While iterative solving methods almost come to a limit when it comes to reducing the number of iterations, preconditioning techniques become important to further accelerate the inversion.

In this contribution, we wish to advocate the use of general parallel SSOR preconditioning techniques in lattice QCD. Our approach may be regarded as a generalization of the well known odd-even (two variety) ordering to a more flexible (*many * variety) layout or, alternatively, as a localization of the globally lexicographic ordering.

2 PRECONDITIONING

To precondition eq. 1, we take two non-singular matrices $V_{1}$ and $V_{2}$ which act as a left and a right preconditioner, i.e. we consider the new system

[TABLE]

We could now apply efficient solvers like BiCGstab replacing each occurrence of $M$ and $\phi$ by $V_{1}^{-1}MV_{2}^{-1}$ and $\tilde{\phi}$ , respectively.

The purpose of preconditioning is to reduce the number of iterations and the computing time necessary to achieve a given accuracy. This means that (a) $V=V_{1}V_{2}$ has to be a sufficiently good approximation to the inverse of $M$ and (b) finding solutions for $V_{1}$ and $V_{2}$ should be sufficiently cheap.

Consider the decomposition of $M$ into its diagonal, strictly lower $L$ and strictly upper $U$ triangular parts

[TABLE]

The SSOR preconditioner is given by

[TABLE]

For the SSOR preconditioner we have $V_{1}+V_{2}-M=I$ . This relation can be exploited through the ‘Eisenstat-trick’ [1]: with $V_{1}^{-1}MV_{2}^{-1}=V_{2}^{-1}+V_{1}^{-1}(I-V_{2}^{-1})$ , the matrix vector product $w=V_{1}^{-1}MV_{2}^{-1}r$ can economically be computed in the form

[TABLE]

Note that multiplications with $M$ are completely avoided in this formulation, the only matrix operations being multiplied with are $I-L$ and $I-U$ . Since these matrices are triangular, the solutions can be computed directly via forward or backward substitution.

The preconditioned residuals $\tilde{r}_{i}$ are related to the unpreconditioned residuals [5]. Upon successful stopping, one can compute $r_{i}$ and restart if the solution is not yet accurate enough.

3 ORDERINGS

In eq. 1 with the Wilson fermion matrix $M$ we have the freedom to choose any ordering scheme for the lattice points $x$ . Different orderings yield different matrices $M$ , which are permutationally similar to each other.

Consider an arbitrary ordering of the lattice points. For a given grid point $x$ , the corresponding row in the matrix $L$ or $U$ contains exactly the coupling coefficients of those nearest neighbors of $x$ which have been numbered before or after $x$ , respectively. Therefore, a generic formulation of the forward or backward solution for this ordering is given by the rules

•

touch every site

•

forward solve: respect site numbered before

•

backward solve: respect site numbered after

In this context, the odd-even ordering, so far generally considered as the only successful preconditioner in a parallel computing environment, is seen to be a specific example of SSOR preconditioning (in odd-even ordering all odd lattice points are numbered before the even ones). In traditional QCD computations, the odd-even preconditioning is not implemented by using the above formulation of the forward (and backward) solvers, as for this particular ordering the inverses of $I-L$ and $I-U$ can be determined directly.

Defining $M$ with the natural ( global lexicographic [gl]) ordering (fig.1) leads to a further improvement over odd-even preconditioning as far as the number of iterations is concerned [2] . However, its parallel implementation turned out to be impractical [3].

4 PARALLELISATION

Unlike the lexicographical ordering, the ordering we propose now is adapted to the parallel computer used to solve eq. 1. We assume that the processors of the parallel computer are connected as a $p_{1}\times p_{2}\times p_{3}\times p_{4}$ 4-dimensional grid. The space-time lattice can be matched to the processor grid in a natural manner, producing a local lattice of size $n^{loc}_{1}\times n^{loc}_{2}\times n^{loc}_{3}\times n^{loc}_{4}$ with $n^{loc}_{i}=n_{i}/p_{i}$ on each processor.

Let us partition the whole lattice into $n^{loc}=n^{loc}_{1}n^{loc}_{2}n^{loc}_{3}n^{loc}_{4}$ groups. Each group corresponds to a fixed position of the local grid and contains all grid points appearing at this position within their respective local grid.

We now consider a natural ordering on each local grid, which allows a coherent update of corresponding sites on each processor in parallel. Inter-node communication has to be done when the local grid border is touched. Fig.2 shows the speedups of various local lattices sizes ( $16=2^{4},64=2^{2}\cdot 4^{2},128=2\cdot 4^{3},256=2\cdot 8\cdot 4^{2}$ ). It should be noted that the number of floating point operations per iteration is the same in each case.

5 APPLICATION

Our numerical tests of the locally lexicographic SSOR preconditioner were performed on APE100/Quadrics machines (Q4,QH4), a SIMD parallel architecture optimized for fast floating point arithmetic on block data-structures like $3\times 3$ SU(3) matrices. We applied the $ll$ preconditioner to quark propagator calculations and chiral Hybrid Monte Carlo simulations with Wilson fermions on large lattices. The CPU time gains for the computation of propagators are shown in fig.3. In our current HMC implementation on a QH4 with a $24^{3}\times 40$ lattice close to the chiral regime we found a speedup in convergence rate by a factor 2, corresponding to a gain in CPU time of 70% in our implementation.

6 CONCLUSION

We have presented a new local grid point ordering scheme that allows to carry out efficient preconditioning of Krylov subspace solvers. The number of iterations as well as the required floating point operations are reduced by a factor of 2 for local lattice sizes ( $>$ 128 ). This corresponds to an implementation and machine dependent gain in CPU time of 30% - 70%.

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Eisenstat, S.: Efficient Implementation of a Class of Preconditioned CG Methods, SIAM J. Sci. Stat. Comp. 2 1 (1981).
2[2] Oyanagi, Y.: An Incomplete LDU Decomposition of Lattice Fermions and its Application to Conjugate Residual Methods, Comp. Phys. Comm. 42, 333 (1986).
3[3] Hockney, G.: Nuclear Physics B (Proc. Suppl.) 17 301 (1990).
4[4] Frommer, A., Hannemann, V., Nöckel, B., Lippert, Th. and Schilling, K.: Accelerating Wilson Fermion Matrix Inversions by Means of Bi Cg Stab, Int. J. of Mod. Phys. C Vol. 5 No. 6, 1073–1088 (1994).
5[5] S. Fischer, A. Frommer, U. Glässner, Th. Lippert, G. Ritzenhöfer, K. Schilling : A Parallel SSOR Preconditioner for Lattice QCD; Int. J. of Mod. Phys. C. in print.