A mixed precision LOBPCG algorithm
Daniel Kressner, Yuxin Ma, Meiyue Shao

TL;DR
This paper introduces a mixed precision variant of the LOBPCG algorithm that reduces computation time significantly while maintaining convergence quality, suitable for large Hermitian matrices on CPUs and GPUs.
Contribution
The paper presents a novel mixed precision LOBPCG algorithm with a mixed precision preconditioner and orthogonalization strategy, along with theoretical analysis of its convergence impact.
Findings
Reduces computation time by a factor of 1.4--2.0
Maintains marginal impact on convergence
Effective on both CPUs and GPUs
Abstract
The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix A. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of A computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of 1.4--2.0 on both…
| Cholesky | TRSM | mat–vec | others | |
|---|---|---|---|---|
| CPU/sparse | CHOLMOD | CHOLMOD | MKL | LAPACK |
| GPU/sparse | CHOLMOD | cuSPARSE | cuSPARSE | MAGMA |
| CPU/dense | LAPACK | LAPACK | LAPACK | LAPACK |
| GPU/dense | MAGMA | MAGMA | MAGMA | MAGMA |
| Run time of Algorithm 2 | Run time of DGEQRF | Savings | |
|---|---|---|---|
| Name | Size | NNZ | Sparsity | NNZ of |
|---|---|---|---|---|
| obstclae | 040,000 | 0,197,608 | 1,561,880 | |
| shallow_water2 | 081,920 | 0,327,680 | 3,483,014 | |
| Dubcova2 | 065,025 | 1,030,225 | 3,804,558 | |
| Dubcova3 | 146,689 | 3,636,643 | 7,409,077 | |
| finan512 | 074,752 | 0,596,992 | 3,376,835 | |
| 2D-Laplace | 025,000 | 0,114,990 | 0,466,491 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Electromagnetic Scattering and Analysis · Tensor decomposition and applications
\jyear
2023
\equalcont
These authors contributed equally to this work.
\equalcont
These authors contributed equally to this work.
\equalcont
These authors contributed equally to this work.
1]\orgdivInstitute of Mathematics, \orgnameEPFL, \orgaddress\cityLausanne, \postcodeCH-1015, \countrySwitzerland
2]\orgdivSchool of Mathematical Sciences, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina
3]\orgdivSchool of Data Science, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina
4]\orgdivMOE Key Laboratory for Computational Physical Sciences, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina
A mixed precision LOBPCG algorithm
\fnmDaniel \surKressner
\fnmYuxin \surMa
\fnmMeiyue \surShao
[
[
[
[
Abstract
The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix
. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of
computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of
–
on both CPUs and GPUs.
keywords:
Symmetric eigenvalue problem, LOBPCG algorithm, mixed precision algorithm
pacs:
[
MSC Classification]65F15, 65F50
1 Introduction
Given a large Hermitian positive definite matrix
, this work considers the computation of the
smallest eigenvalues and the corresponding eigenvectors , , satisfying
[TABLE]
where
and
is diagonal with diagonal entries , , . This problem is often encountered in many applications, such as PDE and optimization problem, electronic structure calculations and machine learning; see, for example, BGHRCV2010 ; K2017 ; S2011 .
When a good preconditioner for
is available, the preconditioned inverse iteration (PINVIT) from N2001-1 is a good candidate for solving such eigenvalue problems. For
, PINVIT takes the form
[TABLE]
for some starting vector
. Here,
denotes the Rayleigh quotient, which is also used to approximate the eigenvalue at each iteration. Note that PINVIT with the “ideal” preconditioner
becomes equivalent to inverse iteration. When computing several () smallest eigenpairs, one chooses a starting matrix
(
) with orthonormal columns and one step of the block version of PINVIT N2002 takes the form
[TABLE]
where
. The next iterate is obtained from orthonormalizing the columns of by, e.g., a QR factorization. Under mild conditions, linear convergence of PINVIT is proven in AKNOZ2017 , with a convergence rate depending on the quality of the preconditioner
. The locally optimal block preconditioned conjugate gradient (LOBPCG) method K2001 aims at accelerating the convergence of PINVIT by choosing the next iterate optimally from a -dimensional subspace that contains the current as well as the previous iterate and the preconditioned residual; see Section 2 for more details. LOBPCG converges at least as fast as PINVIT and often significantly faster.
Executing an algorithm in reduced (single) precision on, e.g., a GPU, can be significantly faster than executing it in default working (double) precision. On the other hand, critical applications may require eigenvalues and eigenvectors computed to an accuracy warranted by working precision. In such a scenario the use of mixed precision algorithms can be beneficial; see AABCCD2021 ; HM2022 for an overview. For example, Carson and Higham CH2018 proposed a general framework for large-scale mixed precision linear system solvers based on iterative refinement. It is highlighted that a mixed precision algorithm can be twice as fast as a traditional linear system solver by computing the most expensive part—LU factorization—in reduced precision. For eigenvalue problems, mixed precision algorithms have recently been proposed for computing all eigenvalues and eigenvectors of a dense matrix. This includes the Newton-like iterative refinement methods for symmetric OA2018 ; OA2019 ; OA2020 and nonsymmetric BKS2022 eigenvalue problems, as well as a mixed precision one-sided Jacobi SVD algorithm GMS2022 . If only a few eigenvalues and eigenvectors are of interest, one could combine mixed precision with classical iterative refinement D1982 for eigenvalue problems, which solves linear systems with the shifted matrix in order to correct an approximation of the th eigenvalue. The need for solving several differently shifted linear systems makes such an approach rather expensive.
In this work, we propose mixed precision PINVIT and LOBPCG algorithms that use a (sparse) Cholesky factorization of computed in reduced precision as preconditioner. This reduces the cost of accurately computing eigenvalues and eigenvectors in significantly compared to inverse iteration, which requires to carry out the Cholesky factorization in working precision. On the theoretical side, we carry out a rounding error analysis of PINVIT, which predicts that reducing precision in the preconditioner usually only has a marginal impact on convergence. On the experimental side, we demonstrate for sparse matrices that our mixed precision LOBPCG algorithm results in up to
speedup on a CPU and
speedup on a GPU. For dense matrices, the speedups are
on a CPU and
on a GPU.
The rest of this paper is organized as follows. In Section 2, we explain the basic ideas of LOBPCG algorithm. Then in Section 3, we propose our mixed precision algorithms and the details of the implementation. The analysis is shown in Section 4 and numerical experiments are presented in Section 5 to show the efficiency of our mixed precision LOBPCG algorithm.
2 LOBPCG algorithm
In this section, we explain the basic idea of the LOBPCG algorithm from K2001 . For , LOBPCG can be derived from the preconditioned conjugate gradient (PCG) method. PCG applied to the (singular) linear system
with preconditioner and initial guess
is a three-term recurrence of the form
[TABLE]
where
,
are chosen to minimize
As the smallest eigenvalue is usually unknown, it needs to be replaced by an approximation, the Rayleigh quotient
, leading to the basic form of LOBPCG:
[TABLE]
where
,
, and
are chosen to minimize
. Note that, unlike PCG, LOBPCG is not a Krylov subspace method in the usual sense because
is different in each iteration.
For
, LOBPCG takes an initial guess
with , and produces iterates of the form
[TABLE]
where
with
. The matrix is chosen to minimize
[TABLE]
where denotes the trace of a matrix. By the Rayleigh–Ritz method, a solution of (1) is obtained from the eigenvectors belonging to the smallest eigenvalues of the generalized eigenvalue problem
; see (GV2013, , Section 8.7.2) for numerical algorithms.
Let us stress that the actual implementation of LOBPCG is quite different DSYG2018 due to the numerical instability caused by the ill-conditioning of . In practice
can be orthogonalized by an improved Hetmaniuk–Lehoucq trick (DSYG2018, , Section 4.2), and then the remaining block,
, also needs to be orthogonalized carefully.
3 Mixed precision algorithms
In this section, we derive a mixed precision LOBPCG algorithm. For this purpose, we consider two precisions: a working precision and a lower/reduced precision, e.g., IEEE double and single precisions. The input and output data of our algorithms are always stored in working precision. The functions
and
are used to convert working precision data into lower precision and vice versa.
3.1 Lower precision preconditioning
The application of the preconditioner usually consumes a considerable fraction of the computational expense of PINVIT and LOBPCG. This suggests to implement the application of in lower precision. In most cases, we expect that this only has a small impact on convergence. While a more detailed analysis will be provided in Section 4, the existing convergence analysis of PINVIT already provides a good intuition.
By (AKNOZ2017, , Theorem 2.1), PINVIT with converges to the smallest eigenvalue and eigenvector when
and additional mild conditions are satisfied. Asymptotically, the convergence is linear with a rate that is bounded by . If is perturbed by rounding error in lower precision one effectively applies a preconditioner , which remains close to . In turn, the convergence is now determined by , which remains close to . Unless is very close to we thus expect that replacing by does not affect convergence significantly. These considerations lead to Algorithm 1, PINVIT with a lower precision preconditioner.
3.2 A mixed precision orthogonalization procedure
In both PINVIT and LOBPCG, we need to produce an orthogonal basis of the searching subspace in each iteration. Moreover, orthogonalization plays an important role to ensure numerical stability for the LOBPCG algorithm DSYG2018 ; HL2006 . We need to perform the orthogonalization procedure as accurately as possible. However, orthogonalization is often quite expensive in practice. Therefore it is desirable to make use of a lower precision to accelerate this procedure.
There are mainly two existing mixed precision algorithms for computing the QR factorization. The algorithm proposed in YTD2015 uses higher precision to compute the inner product to enhance the numerical stability of Cholesky-QR algorithm. The drawback is that this algorithm can be much slower than the standard Cholesky-QR algorithm if higher precision arithmetic lacks hardware support. To improve the performance, a mixed precision block Gram–Schmidt orthogonalization algorithm was proposed in YTKDB2015 . For both algorithms the orthogonality of the output depends linearly on the condition number of the input.
We propose another mixed precision approach for orthogonalization. We first use Householder-QR to factorize
in lower precision. Then
is used as a preconditioner—we apply Cholesky-QR to the preconditioned matrix
to refine the orthogonality. Under mild assumptions
is reasonably well-conditioned, so that the Cholesky-QR algorithm is sufficiently accurate. This mixed precision QR factorization algorithm is summarized in Algorithm 2.
3.3 A mixed precision LOBPCG algorithm
In addition to preconditioning and orthogonalization, the application of and other parts of PINVIT and LOBPCG may also constitute nonnegligible expenses, depending on the specific setting. Carrying out these parts in lower precision bears the risk of limiting the attainable accuracy to lower precision. However, very often it is still possible to further exploit lower precision arithmetic.
As PINVIT and LOBPCG converge linearly in general, we can break the computation in two stages as follows. In the first stage we can first perform all computations in lower precision to produce an approximate solution in lower precision. Then in the second stage we switch back to the working precision while using the approximate solution as an initial guess and applying lower precision preconditioning. In this manner we are able to obtain a satisfactory solution in working precision by making use of lower precision arithmetic as much as possible.
In summary, we compute a good initial guess in lower precision, and then refine the solution using the LOBPCG algorithm in working precision. Lower precision are exploited in both preconditioning and orthogonalization in the LOBPCG algorithm. The resulting mixed precision LOBPCG algorithm is summarized in Algorithm 3.
4 Convergence in finite-precision arithmetic
In our experiments, we observe that rounding error does not significantly affect the convergence of Algorithms 1 and 3 until an accuracy on the level of working precision is reached. To gain theoretical insights on this observation, we study the effect of rounding error on PINVIT for :
[TABLE]
For simplicity, we consider real matrices, that is,
is positive definite with eigenvalues
. Moreover, we assume that
is far larger than the unit roundoff, even in reduced precision.
In analyzing the effect of rounding error on (2), we assume that the computed matrix–vector product satisfies the backward error
[TABLE]
for some symmetric (depending on ). When carrying out standard matrix–vector multiplication with a dense or sparse matrix then Lemma 6.6 in H2002 states that (3) holds with
[TABLE]
where denotes the unit roundoff in working precision.
Lemma 1**.**
*Let
denote the result of evaluating
in working precision. Assuming that (3) holds, there exist a symmetric matrix
and a diagonal matrix
such that*
[TABLE]
*where
and
with*
[TABLE]
**Proof: **We first analyze the rounding error when forming
. From (H2002, , Equation (3.5)) and (3), we obtain
[TABLE]
Thus, we have
[TABLE]
Combined with
for , this implies for
that there is
such that
[TABLE]
The vector subtraction and scaling when forming
yield two diagonal matrices
and
such that
[TABLE]
where
and
. Combined with (4), this concludes the proof because
[TABLE]
We model the inexact application of the preconditioner
to in the iteration (2) with the equation
[TABLE]
where
depends on the choice of preconditioner
and the way to compute
. Note that
also depends on
.
Theorem 2**.**
*Consider the setting of Lemma 1 and (5). If
and*
[TABLE]
with
[TABLE]
*then the computed result
of the PINVIT iteration (2) satisfies*
[TABLE]
**Proof: **By (2), (5), and Lemma 1, there exists a diagonal matrix
(coming from the vector addition) such that
and
[TABLE]
where
. Setting and using that , it follows that
[TABLE]
which takes the form of PINVIT with a perturbed preconditioner. This allows us to apply (AKNOZ2017, , Theorem 2.1), which requires the preconditioner to satisfy
[TABLE]
We now treat the different terms involved in (6) separately. First, we have
[TABLE]
By the assumptions, the spectral radius of
is given by
[TABLE]
This allows us to bound the other terms in (6) as follows:
[TABLE]
Together with (7), this implies that the left-hand side of (6) is bounded by and the statement of the theorem follows from (AKNOZ2017, , Theorem 2.1).
Remark*.*
We remark that the conclusion of Theorem 2 does not imply that
can eventually drop below machine precision. For the relative error
\ignorespaces\bigl{(}\rho({x}_{i})-\lambda_{1}\bigr{)}/\bigl{(}\lambda_{2}-\rho({x}_{i})\bigr{)}
to be reduced by the factor
[TABLE]
during the th iteration, Theorem 2 requires that
[TABLE]
holds. This reduction takes place until a Rayleigh quotient
for an iterate
is produced for which
[TABLE]
For reasonable choices of , this means that the error is reduced until it reaches the level of working precision.
The quantity
critically determines the convergence rate of PINVIT. The following lemma provides an estimate if corresponds to applying in low precision via the Cholesky factorization.
Lemma 3**.**
*Suppose that the application of the preconditioner in one step of PINVIT (2) is implemented by applying in low precision, via performing the Cholesky factorization of followed by forward and backward substitution. If
, where
and denotes unit roundoff in low precision, then*
[TABLE]
**Proof: **Using (H2002, , Theorem 10.4), there exists a symmetric matrix
such that
[TABLE]
which means
and, moreover,
[TABLE]
Then by
\ignorespaces\bigl{\lVert}A^{-1/2}E_{0}A^{-1/2}\bigr{\rVert}_{2}\leq 4n(3n+1)\kappa(A)\bm{u}_{l}<1
, we have
[TABLE]
Thus, it holds that
[TABLE]
5 Numerical experiments
In this section, we present numerical results for our mixed precision LOBPCG algorithm. In our tests, the working precision is IEEE double precision and the lower precision is IEEE single precision. Most tests are performed on a Linux server equipped with two twelve-core Intel Xeon E5-2670 v3 2.30 GHz CPUs and two Nvidia GeForce GTX 1080 GPUs. The tests in Section 5.5 also use an Nvidia A30 GPU. There are 128 GB of main memory on the CPUs and 11,178.6 MB of main memory on each GPU. Our program uses only one GPU and one thread on the CPU.
5.1 Experiment settings
In our experiments we compute a few smallest eigenvalues and the corresponding eigenvectors of Hermitian matrices using the LOBPCG algorithm. The following variants of the LOBPCG algorithm are tested:
DLOBPCG-dchol: LOBPCG algorithm performed entirely in double precision. 2. 2.
DLOBPCG-schol: LOBPCG algorithm performed in double precision, except for single precision preconditioning. 3. 3.
MPLOBPCG-schol: mixed precision LOBPCG algorithm (Algorithm 3) with single precision preconditioning and initial guess computed by the single precision LOBPCG algorithm; mixed precision orthogonalization (Algorithm 2) is also used.
When to computing
eigenpairs, we run LOBPCG algorithm with a block size that is about
larger in order to enhance robustness. The algorithm terminates once the
smallest eigenvalues and the corresponding eigenvectors converge. The convergence criterion is
[TABLE]
where
is estimated through
using a Gaussian random matrix
with
. The threshold
in (8) is set to
for all these three algorithms, and is
when computing a good initial guess for MPLOBPCG-schol.
In our tests, we use
as the preconditioner for Algorithm 3, where
is a permutation matrix, and
is the (pivoted) Cholesky factor of
satisfying
computed in single precision. The preconditioning stage in DLOBPCG-schol/MPLOBPCG-schol is to compute
by solving two triangular systems in single precision. In practice, we apply
to the given matrix
instead of applying
to
in each iteration. We can benefit from it if
is sufficiently sparse or the convergence of LOBPCG is not too rapid (i.e., it takes many iterations to converge).
We test the LOBPCG algorithm for both sparse matrices and dense matrices. Table 1 summarizes the software libraries used under different settings. The CHOLMOD package CDHR2008 can compute sparse Cholesky factorization on both CPU and GPU, while triangular linear solvers are only supported only in CPU. Note that CHOLMOD was developed only for double precision arithmetic; we have derived a single precision version for the purpose of our tests.
5.2 Advantage of mixed precision orthogonalization
Before discussing the LOBPCG algorithm, we first report the run time and savings of the mixed precision orthgonalization algorithm (i.e., Algorithm 2) in Table 2. We can see that for tall-skinny matrices Algorithm 2 can reduce the run time by a factor of
–
compared to DGEQRF in cuSOLVER. Thus, it is worth using this mixed approach for orthogonalization.
5.3 Tests for sparse matrices
We choose six sparse positive definite matrices from from the SuiteSparse Matrix Collection.111https://sparse.tamu.edu Table 3 shows the information of these sparse matrices. We compute
eigenpairs using a randomly generated initial guess with
columns for each matrix, and report the relative run time, which is the ratio of the wall clock time of a solver over the wall clock time of DLOBPCG-dchol.
Figures 1 and 2, respectively, show the relative run time on CPU and GPU. For all test cases, preconditioning in single precision reduces the execution time of the LOBPCG algorithm. Using an initial guess computed by the single precision LOBPCG algorithm and adopting mixed precision orthogonalization makes the algorithm more efficient. Compared to DLOBPCG-dchol, MPLOBPCG-schol is about
faster on CPU, and is about
faster on GPU.
We should also mention that the number of iterations for different variants of the LOBPCG algorithms are similar, though they are not shown in the figures. Sometimes MPLOBPCG-schol can require fewer iterations to converge because there is a restart when we use the lower precision result as the initial guess. For instance, the total iterations of DLOBPCG-dchol, DLOBPCG-schol and MPLOBPCG-schol are
,
, and
, respectively, for the 2D-Laplace matrix.
5.4 Tests for dense matrices
We also test the LOBPCG algorithm for a few dense matrices which are popular in machine learning. These dense matrices are kernel matrices generated by certain kernel functions as follows. Let
,
,
,
be uniform random vectors generated by
from LAPACK. We construct a matrix
by applying the Gaussian kernel function
[TABLE]
Similarly, we can apply the polynomial kernel function
[TABLE]
to construct another kernel matrix. Using two sets of random vectors
and
in
, we also construct complex kernel matrices through
[TABLE]
where
is either the Gaussian kernel function or the polynomial kernel function.
We choose
in our experiments, and compute
smallest eigenvalues and the corresponding eigenvectors. The rank of initial guess is chosen as
accordingly. Figures 3, 4, and 5 show the relative run time of different variants of the LOBPCG algorithm. For real matrices, MPLOBPCG-schol achieves
and
speedup compared to DLOBPCG-dchol on CPU and GPU, respectively. The speedup is higher than that for sparse matrices, because dense matrices are more compute-intensive. The benefit for mixed precision approaches is more significant for complex matrices—the speedup becomes over
and up to
on GPU.
5.5 Tests on different GPUs
By far our tests are performed with an Nvidia GeForce GTX 1080 GPU, which is a consumer-grade GPU. In fact, there are two different types of GPU—consumer-grade and server-grade. Compared to consumer-grader GPUs server-grade GPUs usually have better hardware support for double precision arithmetic. Hence the performance difference between single and double precision arithmetic is larger on consumer-grade GPUs.
In the following we report some results collected from runs on an Nvidia A30 GPU, which is a server-grade one. We use the matrix 2D-Laplace in this test. By perturbing off-diagonal entries of this matrix by
, we also obtain a Hermtian positive definite matrix for testing complex arithmetic. From Figure 6, it can be seen that single precision has limited advantage over double precision on this server-grade GPU. Though MPLOBPCG-schol still achieves about
speedup compared to DLOBPCG-dchol, the benefit for adopting single precision arithmetic is much lower than that on NVIDIA GeForce GTX-1080 which is a consumer-grade GPU.
6 Conclusion
In this paper, we have proposed a mixed precision LOBPCG algorithm with a preconditioner based on a (sparse) Cholesky factorization. Both the initial guess and the preconditioner are computed in reduced precision. This largely improves the performance while it only has marginal impact on convergence. In our mixed precision LOBPCG algorithm, orthogonalization is also performed in a mixed precision manner to further improve performance. We analyze the rounding error of the PINVIT algorithm, which can be viewed as a simplified version of the LOBPCG algorithm, to confirm that our mixed precision algorithm is as accurate as the fixed precision one. Numerical experiments illustrate that adopting mixed precision arithmetic can significantly accelerate the execution of the LOBPCG algorithm on both CPUs and GPUs.
\bmhead
Acknowledgments The authors thank Erin Carson for helpful discussions. Part of this work was performed when the second author was visiting EPF Lausanne in 2022.
Yuxin Ma is partially supported by the State Scholarship Fund of China Scholarship Council (CSC) under Grant No. 202106100093, National Key R&D Program of China under Grant No. 2021YFA1003305 and National Natural Science Foundation of China under Grant No. 71991471. Meiyue Shao is partially supported by by the National Natural Science Foundation of China under grant No. 11971118.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1\bibcommenthead
- 2(1) Balcan, D., Gonçalves, B., Hu, H., Ramasco, J.J., Colizza, V., Vespignani, A.: Modeling the spatial spread of infectious diseases: the G Lobal Epidemic and Mobility computational model. J. Comput. Sci. 1 (3), 132–145 (2010). https://doi.org/10.1016/j.jocs.2010.07.002 · doi ↗
- 3(2) Knyazev, A.: Recent implementations, applications, and extensions of the locally optimal block preconditioned conjugate gradient method (LOBPCG). ar Xiv:1708.08354 (2017)
- 4(3) Saad, Y.: Numerical Methods for Large Eigenvalue Problems: Revised Edition. SIAM, Philadelphia, PA, USA (2011)
- 5(4) Neymeyr, K.: A geometric theory for preconditioned inverse iteration I: Extrema of the Rayleigh quotient. Linear Algebra Appl. 322 (1-3), 61–85 (2001). https://doi.org/10.1016/S 0024-3795(00)00239-1 · doi ↗
- 6(5) Neymeyr, K.: A geometric theory for preconditioned inverse iteration applied to a subspace. Math. Comp. 71 (237), 197–216 (2002). https://doi.org/10.1090/S 0025-5718-01-01357-6 · doi ↗
- 7(6) Argentati, M., Knyazev, A., Neymeyr, K., Ovtchinnikov, E., Zhou, M.: Convergence theory for preconditioned eigenvalue solvers in a nutshell. Found. Comput. Math. 17 , 713–727 (2017). https://doi.org/10.1007/s 10208-015-9297-1 · doi ↗
- 8(7) Knyazev, A.V.: Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23 (2), 517–541 (2001). https://doi.org/10.1137/S 1064827500366124 · doi ↗
