A mixed precision LOBPCG algorithm

Daniel Kressner; Yuxin Ma; Meiyue Shao

arXiv:2302.12528·math.NA·May 6, 2024

A mixed precision LOBPCG algorithm

Daniel Kressner, Yuxin Ma, Meiyue Shao

PDF

Open Access

TL;DR

This paper introduces a mixed precision variant of the LOBPCG algorithm that reduces computation time significantly while maintaining convergence quality, suitable for large Hermitian matrices on CPUs and GPUs.

Contribution

The paper presents a novel mixed precision LOBPCG algorithm with a mixed precision preconditioner and orthogonalization strategy, along with theoretical analysis of its convergence impact.

Findings

01

Reduces computation time by a factor of 1.4--2.0

02

Maintains marginal impact on convergence

03

Effective on both CPUs and GPUs

Abstract

The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix A. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of A computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of 1.4--2.0 on both…

Tables3

Table 1. Table 1: Libraries used in our implementation.

	Cholesky	TRSM	mat–vec	others
CPU/sparse	CHOLMOD	CHOLMOD	MKL	LAPACK
GPU/sparse	CHOLMOD	cuSPARSE	cuSPARSE	MAGMA
CPU/dense	LAPACK	LAPACK	LAPACK	LAPACK
GPU/dense	MAGMA	MAGMA	MAGMA	MAGMA

Table 2. Table 2: Run time of Algorithm 2 in seconds.

$(n, k)$	Run time of Algorithm 2	Run time of DGEQRF	Savings
$(40000, 20)$	$1.544 \times 10^{- 3}$	$2.078 \times 10^{- 3}$	$25.7 %$
$(40000, 25)$	$1.907 \times 10^{- 3}$	$2.823 \times 10^{- 3}$	$32.5 %$
$(40000, 30)$	$2.371 \times 10^{- 3}$	$3.579 \times 10^{- 3}$	$33.7 %$
$(40000, 35)$	$3.586 \times 10^{- 3}$	$4.886 \times 10^{- 3}$	$26.6 %$
$(40000, 40)$	$4.000 \times 10^{- 3}$	$5.726 \times 10^{- 3}$	$30.1 %$
$(40000, 45)$	$4.495 \times 10^{- 3}$	$6.862 \times 10^{- 3}$	$34.5 %$

Table 3. Table 3: Information of sparse testing matrices.

Name	Size	NNZ	Sparsity	NNZ of $L$
obstclae	040,000	0,197,608	$1.235 \times 10^{- 4}$	1,561,880
shallow_water2	081,920	0,327,680	$4.883 \times 10^{- 5}$	3,483,014
Dubcova2	065,025	1,030,225	$2.437 \times 10^{- 4}$	3,804,558
Dubcova3	146,689	3,636,643	$1.690 \times 10^{- 4}$	7,409,077
finan512	074,752	0,596,992	$1.068 \times 10^{- 4}$	3,376,835
2D-Laplace	025,000	0,114,990	$1.840 \times 10^{- 4}$	0,466,491

Equations95

A X = X Λ,

A X = X Λ,

x_{i+1}=x_{i}-T\bigl{(}Ax_{i}-\rho(x_{i})x_{i}\bigr{)},

x_{i+1}=x_{i}-T\bigl{(}Ax_{i}-\rho(x_{i})x_{i}\bigr{)},

\tilde{X}_{i + 1} = X_{i} - T (A X_{i} - X_{i} Θ_{i}),

\tilde{X}_{i + 1} = X_{i} - T (A X_{i} - X_{i} Θ_{i}),

x_{i + 1}

x_{i + 1}

= (1 + β_{i}) x_{i} + (- β_{i}) x_{i - 1} + α_{i} T (A - λ_{1} I) x_{i},

x_{i + 1} = α_{1}^{(i)} x_{i} + α_{2}^{(i)} x_{i - 1} + α_{3}^{(i)} T (A x_{i} - ρ (x_{i}) x_{i}),

x_{i + 1} = α_{1}^{(i)} x_{i} + α_{2}^{(i)} x_{i - 1} + α_{3}^{(i)} T (A x_{i} - ρ (x_{i}) x_{i}),

X_{i + 1} = X_{i} C_{1}^{(i)} + X_{i - 1} C_{2}^{(i)} + W_{i} C_{3}^{(i)} = [X_{i} X_{i - 1} W_{i}] C_{1}^{(i)} C_{2}^{(i)} C_{3}^{(i)} =: S_{i} C_{i},

X_{i + 1} = X_{i} C_{1}^{(i)} + X_{i - 1} C_{2}^{(i)} + W_{i} C_{3}^{(i)} = [X_{i} X_{i - 1} W_{i}] C_{1}^{(i)} C_{2}^{(i)} C_{3}^{(i)} =: S_{i} C_{i},

X_{i + 1}^{*} X_{i + 1} = I min tr (X_{i + 1}^{*} A X_{i + 1}) = C_{i}^{*} S_{i}^{*} S_{i} C_{i} = I min tr (C_{i}^{*} S_{i}^{*} A S_{i} C_{i}),

X_{i + 1}^{*} X_{i + 1} = I min tr (X_{i + 1}^{*} A X_{i + 1}) = C_{i}^{*} S_{i}^{*} S_{i} C_{i} = I min tr (C_{i}^{*} S_{i}^{*} A S_{i} C_{i}),

x_{i+1}=x_{i}-T\bigl{(}Ax_{i}-\rho(x_{i})x_{i}\bigr{)}.

x_{i+1}=x_{i}-T\bigl{(}Ax_{i}-\rho(x_{i})x_{i}\bigr{)}.

f l (A x_{i}) = (A + Δ A) x_{i} with ∥ Δ A ∥_{2} \leq ϵ_{A} ∥ A ∥_{2},

f l (A x_{i}) = (A + Δ A) x_{i} with ∥ Δ A ∥_{2} \leq ϵ_{A} ∥ A ∥_{2},

ϵ_{A} = n γ_{n}^{h} with γ_{n}^{h} = \frac{n u _{h}}{1 - n u _{h}},

ϵ_{A} = n γ_{n}^{h} with γ_{n}^{h} = \frac{n u _{h}}{1 - n u _{h}},

\overset{r}{^}_{i} = (I + E) (r_{i} + F x_{i}),

\overset{r}{^}_{i} = (I + E) (r_{i} + F x_{i}),

\epsilon_{r}=\bigl{(}\gamma_{n}^{h}+\epsilon_{A}+\gamma_{n}^{h}\epsilon_{A}+(n+1)\bm{u}_{h}\bigr{)}\frac{1+\bm{u}_{h}}{1-2n\bm{u}_{h}}+\epsilon_{A}+\bm{u}_{h}.

\epsilon_{r}=\bigl{(}\gamma_{n}^{h}+\epsilon_{A}+\gamma_{n}^{h}\epsilon_{A}+(n+1)\bm{u}_{h}\bigr{)}\frac{1+\bm{u}_{h}}{1-2n\bm{u}_{h}}+\epsilon_{A}+\bm{u}_{h}.

\bigl{\lvert}\operatorname{f{}l}(x_{i}^{\top}Ax_{i})-x_{i}^{\top}\operatorname{f{}l}(Ax_{i})\bigr{\rvert}\leq\gamma_{n}^{h}\lVert x_{i}\rVert_{2}\lVert\operatorname{f{}l}(Ax_{i})\rVert_{2}\leq\gamma_{n}^{h}(1+\epsilon_{A})\lVert A\rVert_{2}\lVert x_{i}\rVert_{2}^{2}.

\bigl{\lvert}\operatorname{f{}l}(x_{i}^{\top}Ax_{i})-x_{i}^{\top}\operatorname{f{}l}(Ax_{i})\bigr{\rvert}\leq\gamma_{n}^{h}\lVert x_{i}\rVert_{2}\lVert\operatorname{f{}l}(Ax_{i})\rVert_{2}\leq\gamma_{n}^{h}(1+\epsilon_{A})\lVert A\rVert_{2}\lVert x_{i}\rVert_{2}^{2}.

\displaystyle\bigl{\lvert}\operatorname{f{}l}(x_{i}^{\top}Ax_{i})-x_{i}^{\top}Ax_{i}\bigr{\rvert}\leq{}

\displaystyle\bigl{\lvert}\operatorname{f{}l}(x_{i}^{\top}Ax_{i})-x_{i}^{\top}Ax_{i}\bigr{\rvert}\leq{}

\leq

\leq

\displaystyle\bigl{\lvert}\operatorname{f{}l}(\rho(x_{i}))-\rho(x_{i})\bigr{\rvert}

\displaystyle\bigl{\lvert}\operatorname{f{}l}(\rho(x_{i}))-\rho(x_{i})\bigr{\rvert}

\displaystyle\leq\left\lvert\frac{\bigl{(}\operatorname{f{}l}(x_{i}^{\top}Ax_{i})-x_{i}^{\top}Ax_{i}\bigr{)}(1+\delta_{2})}{x_{i}^{\top}x_{i}(1+\delta_{1})}\right\rvert+\left\lvert\frac{x_{i}^{\top}Ax_{i}(\delta_{2}-\delta_{1})}{x_{i}^{\top}x_{i}(1+\delta_{1})}\right\rvert

\displaystyle\leq\frac{\bigl{\lvert}\operatorname{f{}l}(x_{i}^{\top}Ax_{i})-x_{i}^{\top}Ax_{i}\bigr{\rvert}}{x_{i}^{\top}x_{i}}\left\lvert\frac{1+\delta_{2}}{1+\delta_{1}}\right\rvert+\rho(x_{i})\left\lvert\frac{\delta_{2}-\delta_{1}}{1+\delta_{1}}\right\rvert

\leq (γ_{n}^{h} (1 + ϵ_{A}) + ϵ_{A}) \frac{1}{1 - 2 n u _{h}} ∥ A ∥_{2} + \frac{( 1 + n ) u _{h}}{1 - 2 n u _{h}} ρ (x_{i})

\displaystyle\leq\bigl{(}\gamma_{n}^{h}(1+\epsilon_{A})+\epsilon_{A}+(n+1)\bm{u}_{h}\bigr{)}\frac{\lVert A\rVert_{2}}{1-2n\bm{u}_{h}}.

\overset{r}{^}_{i}

\overset{r}{^}_{i}

= (I + E) (r_{i} + F x_{i}),

F

∥ F ∥_{2}

∥ F ∥_{2}

\displaystyle\leq\Bigl{(}\bigl{(}\gamma_{n}^{h}+\epsilon_{A}+\gamma_{n}^{h}\epsilon_{A}++(n+1)\bm{u}_{h}\bigr{)}\frac{1+\bm{u}_{h}}{1-2n\bm{u}_{h}}+\epsilon_{A}+\bm{u}_{h}\Bigr{)}\lVert A\rVert_{2}.

\overset{w}{^}_{i} = T_{E} \overset{r}{^}_{i},

\overset{w}{^}_{i} = T_{E} \overset{r}{^}_{i},

\gamma:=\lVert I-A^{1/2}T_{E}A^{1/2}\rVert_{2}+\gamma_{2}^{h}\lVert T_{E}\rVert_{2}\lVert A\rVert_{2}+\beta(x_{i})\bigl{(}\bm{u}_{h}+(1+\gamma_{2}^{h})\epsilon_{r}\lVert T_{E}\rVert_{2}\lVert A\rVert_{2}\bigr{)}<1,

\gamma:=\lVert I-A^{1/2}T_{E}A^{1/2}\rVert_{2}+\gamma_{2}^{h}\lVert T_{E}\rVert_{2}\lVert A\rVert_{2}+\beta(x_{i})\bigl{(}\bm{u}_{h}+(1+\gamma_{2}^{h})\epsilon_{r}\lVert T_{E}\rVert_{2}\lVert A\rVert_{2}\bigr{)}<1,

\beta(x_{i})=\max\biggl{\{}\frac{\sqrt{\lambda_{1}\lambda_{n}}}{\rho(x_{i})-\lambda_{1}},\frac{\sqrt{\lambda_{2}\lambda_{n}}}{\lambda_{2}-\rho(x_{i})}\biggr{\}},

\beta(x_{i})=\max\biggl{\{}\frac{\sqrt{\lambda_{1}\lambda_{n}}}{\rho(x_{i})-\lambda_{1}},\frac{\sqrt{\lambda_{2}\lambda_{n}}}{\lambda_{2}-\rho(x_{i})}\biggr{\}},

\frac{\rho(\hat{x}_{i+1})-\lambda_{1}}{\lambda_{2}-\rho(\hat{x}_{i+1})}\leq\Bigl{(}\gamma+(1-\gamma)\frac{\lambda_{1}}{\lambda_{2}}\Bigr{)}^{2}\frac{\rho(x_{i})-\lambda_{1}}{\lambda_{2}-\rho(x_{i})}.

\frac{\rho(\hat{x}_{i+1})-\lambda_{1}}{\lambda_{2}-\rho(\hat{x}_{i+1})}\leq\Bigl{(}\gamma+(1-\gamma)\frac{\lambda_{1}}{\lambda_{2}}\Bigr{)}^{2}\frac{\rho(x_{i})-\lambda_{1}}{\lambda_{2}-\rho(x_{i})}.

\overset{x}{^}_{i + 1}

\overset{x}{^}_{i + 1}

\displaystyle=x_{i}-\bigl{(}\tilde{T}_{E}(A-\rho(x_{i})I)-E_{0}+\tilde{T}_{E}F\bigr{)}x_{i},

\hat{x}_{i+1}=x_{i}-\bigl{(}\tilde{T}_{E}-E_{0}A_{\rho}^{-1}+\tilde{T}_{E}FA_{\rho}^{-1}\bigr{)}r_{i},

\hat{x}_{i+1}=x_{i}-\bigl{(}\tilde{T}_{E}-E_{0}A_{\rho}^{-1}+\tilde{T}_{E}FA_{\rho}^{-1}\bigr{)}r_{i},

\bigl{\lVert}I-A^{1/2}(\tilde{T}_{E}-E_{0}A_{\rho}^{-1}+\tilde{T}_{E}FA_{\rho}^{-1})A^{1/2}\bigr{\rVert}_{2}<1.

\bigl{\lVert}I-A^{1/2}(\tilde{T}_{E}-E_{0}A_{\rho}^{-1}+\tilde{T}_{E}FA_{\rho}^{-1})A^{1/2}\bigr{\rVert}_{2}<1.

∥ I - A^{1/2} \tilde{T}_{E} A^{1/2} ∥_{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMatrix Theory and Algorithms · Electromagnetic Scattering and Analysis · Tensor decomposition and applications

Full text

\jyear

2023

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

1]\orgdivInstitute of Mathematics, \orgnameEPFL, \orgaddress\cityLausanne, \postcodeCH-1015, \countrySwitzerland

2]\orgdivSchool of Mathematical Sciences, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

3]\orgdivSchool of Data Science, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

4]\orgdivMOE Key Laboratory for Computational Physical Sciences, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

A mixed precision LOBPCG algorithm

\fnmDaniel \surKressner

[email protected]

\fnmYuxin \surMa

[email protected]

\fnmMeiyue \surShao

[email protected]

[

Abstract

The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix

$\ignorespaces A$

. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of

$\ignorespaces A$

computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of

$\ignorespaces 1.4$

–

$\ignorespaces 2.0$

on both CPUs and GPUs.

keywords:

Symmetric eigenvalue problem, LOBPCG algorithm, mixed precision algorithm

pacs:

[

MSC Classification]65F15, 65F50

1 Introduction

Given a large Hermitian positive definite matrix

$\ignorespaces A\in\mathbb{C}^{n\times n}$

, this work considers the computation of the

$\ignorespaces k$

smallest eigenvalues $0<\lambda_{1}\leq\cdots\leq\lambda_{k}$ and the corresponding eigenvectors $x_{1}$ , $\ldots$ , $x_{k}$ satisfying

[TABLE]

where

$\ignorespaces X=[x_{1},\ldots,x_{k}]$

and

$\ignorespaces\Lambda$

is diagonal with diagonal entries $\lambda_{1}$ , $\ldots$ , $\lambda_{k}$ . This problem is often encountered in many applications, such as PDE and optimization problem, electronic structure calculations and machine learning; see, for example, BGHRCV2010 ; K2017 ; S2011 .

When a good preconditioner $T$ for

$\ignorespaces A$

is available, the preconditioned inverse iteration (PINVIT) from N2001-1 is a good candidate for solving such eigenvalue problems. For

$\ignorespaces k=1$

, PINVIT takes the form

[TABLE]

for some starting vector

$\ignorespaces x_{0}$

. Here,

$\ignorespaces\rho(x)=(x^{*}Ax)/(x^{*}x)$

denotes the Rayleigh quotient, which is also used to approximate the eigenvalue at each iteration. Note that PINVIT with the “ideal” preconditioner

$\ignorespaces T=A^{-1}$

becomes equivalent to inverse iteration. When computing several ( $k>1$ ) smallest eigenpairs, one chooses a starting matrix

$\ignorespaces X_{0}\in\mathbb{C}^{n\times m}$

(

$\ignorespaces m\geq k$

) with orthonormal columns and one step of the block version of PINVIT N2002 takes the form

[TABLE]

where

$\ignorespaces\Theta_{i}=X_{i}^{*}AX_{i}$

. The next iterate $X_{i+1}$ is obtained from orthonormalizing the columns of $\tilde{X}_{i+1}$ by, e.g., a QR factorization. Under mild conditions, linear convergence of PINVIT is proven in AKNOZ2017 , with a convergence rate depending on the quality of the preconditioner

$\ignorespaces T$

. The locally optimal block preconditioned conjugate gradient (LOBPCG) method K2001 aims at accelerating the convergence of PINVIT by choosing the next iterate optimally from a $3m$ -dimensional subspace that contains the current as well as the previous iterate and the preconditioned residual; see Section 2 for more details. LOBPCG converges at least as fast as PINVIT and often significantly faster.

Executing an algorithm in reduced (single) precision on, e.g., a GPU, can be significantly faster than executing it in default working (double) precision. On the other hand, critical applications may require eigenvalues and eigenvectors computed to an accuracy warranted by working precision. In such a scenario the use of mixed precision algorithms can be beneficial; see AABCCD2021 ; HM2022 for an overview. For example, Carson and Higham CH2018 proposed a general framework for large-scale mixed precision linear system solvers based on iterative refinement. It is highlighted that a mixed precision algorithm can be twice as fast as a traditional linear system solver by computing the most expensive part—LU factorization—in reduced precision. For eigenvalue problems, mixed precision algorithms have recently been proposed for computing all eigenvalues and eigenvectors of a dense matrix. This includes the Newton-like iterative refinement methods for symmetric OA2018 ; OA2019 ; OA2020 and nonsymmetric BKS2022 eigenvalue problems, as well as a mixed precision one-sided Jacobi SVD algorithm GMS2022 . If only a few eigenvalues and eigenvectors are of interest, one could combine mixed precision with classical iterative refinement D1982 for eigenvalue problems, which solves linear systems with the shifted matrix $A-\hat{\lambda}_{i}I$ in order to correct an approximation $\hat{\lambda}_{i}$ of the $i$ th eigenvalue. The need for solving several differently shifted linear systems makes such an approach rather expensive.

In this work, we propose mixed precision PINVIT and LOBPCG algorithms that use a (sparse) Cholesky factorization of $A$ computed in reduced precision as preconditioner. This reduces the cost of accurately computing eigenvalues and eigenvectors in significantly compared to inverse iteration, which requires to carry out the Cholesky factorization in working precision. On the theoretical side, we carry out a rounding error analysis of PINVIT, which predicts that reducing precision in the preconditioner usually only has a marginal impact on convergence. On the experimental side, we demonstrate for sparse matrices that our mixed precision LOBPCG algorithm results in up to

$\ignorespaces 1.43\times$

speedup on a CPU and

$\ignorespaces 1.67\times$

speedup on a GPU. For dense matrices, the speedups are

$\ignorespaces 1.67\times$

on a CPU and

$\ignorespaces 2.00\times$

on a GPU.

The rest of this paper is organized as follows. In Section 2, we explain the basic ideas of LOBPCG algorithm. Then in Section 3, we propose our mixed precision algorithms and the details of the implementation. The analysis is shown in Section 4 and numerical experiments are presented in Section 5 to show the efficiency of our mixed precision LOBPCG algorithm.

2 LOBPCG algorithm

In this section, we explain the basic idea of the LOBPCG algorithm from K2001 . For $k=1$ , LOBPCG can be derived from the preconditioned conjugate gradient (PCG) method. PCG applied to the (singular) linear system

$\ignorespaces(A-\lambda_{1}I)x=0$

with preconditioner $T$ and initial guess

$\ignorespaces x_{0}$

is a three-term recurrence of the form

[TABLE]

where

$\ignorespaces\alpha_{i}$

,

$\ignorespaces\beta_{i}$

are chosen to minimize

$x_{i+1}^{*}(A-\lambda_{1}I)x_{i+1}.$

As the smallest eigenvalue $\lambda_{1}$ is usually unknown, it needs to be replaced by an approximation, the Rayleigh quotient

$\ignorespaces\rho(x_{i})$

, leading to the basic form of LOBPCG:

[TABLE]

where

$\ignorespaces\alpha^{(i)}_{1}$

,

$\ignorespaces\alpha^{(i)}_{2}$

, and

$\ignorespaces\alpha^{(i)}_{3}$

are chosen to minimize

$\ignorespaces\rho(x_{i+1})$

. Note that, unlike PCG, LOBPCG is not a Krylov subspace method in the usual sense because

$\ignorespaces\rho(x_{i})$

is different in each iteration.

For

$\ignorespaces k>1$

, LOBPCG takes an initial guess

$\ignorespaces X_{0}\in\mathbb{C}^{n\times m}$

with $m\geq k$ , and produces iterates of the form

[TABLE]

where

$\ignorespaces W_{i}=T(AX_{i}-X_{i}\Theta_{i})$

with

$\ignorespaces\Theta_{i}=X_{i}^{*}AX_{i}$

. The $3m\times m$ matrix $C_{i}$ is chosen to minimize

[TABLE]

where $\operatorname{tr}(\cdot)$ denotes the trace of a matrix. By the Rayleigh–Ritz method, a solution $C_{i}$ of (1) is obtained from the eigenvectors belonging to the $m$ smallest eigenvalues of the generalized eigenvalue problem

$\ignorespaces S_{i}^{*}AS_{i}y=\lambda S_{i}^{*}S_{i}y$

; see (GV2013, , Section 8.7.2) for numerical algorithms.

Let us stress that the actual implementation of LOBPCG is quite different DSYG2018 due to the numerical instability caused by the ill-conditioning of $S_{i}$ . In practice

$\ignorespaces[X_{i},X_{i-1}]$

can be orthogonalized by an improved Hetmaniuk–Lehoucq trick (DSYG2018, , Section 4.2), and then the remaining block,

$\ignorespaces W_{i}$

, also needs to be orthogonalized carefully.

3 Mixed precision algorithms

In this section, we derive a mixed precision LOBPCG algorithm. For this purpose, we consider two precisions: a working precision and a lower/reduced precision, e.g., IEEE double and single precisions. The input and output data of our algorithms are always stored in working precision. The functions

$\ignorespaces\mathtt{lower}(\cdot)$

and

$\ignorespaces\mathtt{working}(\cdot)$

are used to convert working precision data into lower precision and vice versa.

3.1 Lower precision preconditioning

The application of the preconditioner $T$ usually consumes a considerable fraction of the computational expense of PINVIT and LOBPCG. This suggests to implement the application of $T$ in lower precision. In most cases, we expect that this only has a small impact on convergence. While a more detailed analysis will be provided in Section 4, the existing convergence analysis of PINVIT already provides a good intuition.

By (AKNOZ2017, , Theorem 2.1), PINVIT with $k=1$ converges to the smallest eigenvalue and eigenvector when

$\ignorespaces\gamma:=\lVert I-A^{1/2}TA^{1/2}\rVert_{2}<1$

and additional mild conditions are satisfied. Asymptotically, the convergence is linear with a rate that is bounded by $\gamma+(1-\gamma)\lambda_{1}/\lambda_{2}$ . If $T$ is perturbed by rounding error in lower precision one effectively applies a preconditioner $T_{E}$ , which remains close to $T$ . In turn, the convergence is now determined by $\lVert I-A^{1/2}T_{E}A^{1/2}\rVert_{2}$ , which remains close to $\gamma$ . Unless $\gamma$ is very close to $1$ we thus expect that replacing $T$ by $T_{E}$ does not affect convergence significantly. These considerations lead to Algorithm 1, PINVIT with a lower precision preconditioner.

3.2 A mixed precision orthogonalization procedure

In both PINVIT and LOBPCG, we need to produce an orthogonal basis of the searching subspace in each iteration. Moreover, orthogonalization plays an important role to ensure numerical stability for the LOBPCG algorithm DSYG2018 ; HL2006 . We need to perform the orthogonalization procedure as accurately as possible. However, orthogonalization is often quite expensive in practice. Therefore it is desirable to make use of a lower precision to accelerate this procedure.

There are mainly two existing mixed precision algorithms for computing the QR factorization. The algorithm proposed in YTD2015 uses higher precision to compute the inner product to enhance the numerical stability of Cholesky-QR algorithm. The drawback is that this algorithm can be much slower than the standard Cholesky-QR algorithm if higher precision arithmetic lacks hardware support. To improve the performance, a mixed precision block Gram–Schmidt orthogonalization algorithm was proposed in YTKDB2015 . For both algorithms the orthogonality of the output depends linearly on the condition number of the input.

We propose another mixed precision approach for orthogonalization. We first use Householder-QR to factorize

$\ignorespaces\mathtt{lower}(W_{i})=Q_{\mathtt{lower}}R_{\mathtt{lower}}$

in lower precision. Then

$\ignorespaces\mathtt{working}(R_{\mathtt{lower}})$

is used as a preconditioner—we apply Cholesky-QR to the preconditioned matrix

$\ignorespaces W_{i}\cdot\mathtt{working}(R_{\mathtt{lower}}^{-1})$

to refine the orthogonality. Under mild assumptions

$\ignorespaces W_{i}\cdot\mathtt{working}(R_{\mathtt{lower}}^{-1})$

is reasonably well-conditioned, so that the Cholesky-QR algorithm is sufficiently accurate. This mixed precision QR factorization algorithm is summarized in Algorithm 2.

3.3 A mixed precision LOBPCG algorithm

In addition to preconditioning and orthogonalization, the application of $A$ and other parts of PINVIT and LOBPCG may also constitute nonnegligible expenses, depending on the specific setting. Carrying out these parts in lower precision bears the risk of limiting the attainable accuracy to lower precision. However, very often it is still possible to further exploit lower precision arithmetic.

As PINVIT and LOBPCG converge linearly in general, we can break the computation in two stages as follows. In the first stage we can first perform all computations in lower precision to produce an approximate solution in lower precision. Then in the second stage we switch back to the working precision while using the approximate solution as an initial guess and applying lower precision preconditioning. In this manner we are able to obtain a satisfactory solution in working precision by making use of lower precision arithmetic as much as possible.

In summary, we compute a good initial guess in lower precision, and then refine the solution using the LOBPCG algorithm in working precision. Lower precision are exploited in both preconditioning and orthogonalization in the LOBPCG algorithm. The resulting mixed precision LOBPCG algorithm is summarized in Algorithm 3.

4 Convergence in finite-precision arithmetic

In our experiments, we observe that rounding error does not significantly affect the convergence of Algorithms 1 and 3 until an accuracy on the level of working precision is reached. To gain theoretical insights on this observation, we study the effect of rounding error on PINVIT for $k=1$ :

[TABLE]

For simplicity, we consider real matrices, that is,

$\ignorespaces A\in\mathbb{R}^{n\times n}$

is positive definite with eigenvalues

$\ignorespaces 0<\lambda_{1}<\lambda_{2}\leq\dotsb\leq\lambda_{n}$

. Moreover, we assume that

$\ignorespaces n^{-1}$

is far larger than the unit roundoff, even in reduced precision.

In analyzing the effect of rounding error on (2), we assume that the computed matrix–vector product $\operatorname{f{}l}(Ax_{i})$ satisfies the backward error

[TABLE]

for some symmetric $\Delta A$ (depending on $x_{i}$ ). When carrying out standard matrix–vector multiplication with a dense or sparse matrix $A$ then Lemma 6.6 in H2002 states that (3) holds with

[TABLE]

where $\bm{u}_{h}$ denotes the unit roundoff in working precision.

Lemma 1.

*Let

$\ignorespaces\hat{r}_{i}$

denote the result of evaluating

$\ignorespaces r_{i}:=Ax_{i}-\rho(x_{i})x_{i}$

in working precision. Assuming that (3) holds, there exist a symmetric matrix

$\ignorespaces F\in\mathbb{R}^{n\times n}$

and a diagonal matrix

$\ignorespaces E\in\mathbb{R}^{n\times n}$

such that*

[TABLE]

*where

$\ignorespaces\lVert E\rVert_{2}\leq\bm{u}_{h}$

and

$\ignorespaces\lVert F\rVert_{2}\leq\epsilon_{r}\lVert A\rVert_{2}$

with*

[TABLE]

**Proof: **We first analyze the rounding error when forming

$\ignorespaces\rho(x_{i})$

. From (H2002, , Equation (3.5)) and (3), we obtain

[TABLE]

Thus, we have

[TABLE]

Combined with

$\ignorespaces\operatorname{f{}l}(x_{i}^{\top}x_{i})=x_{i}^{\top}x_{i}(1+\delta_{1})$

for $\lvert\delta_{1}\rvert\leq\gamma_{n}^{h}$ , this implies for

$\ignorespaces\rho(x_{i})=x_{i}^{\top}Ax_{i}/(x_{i}^{\top}x_{i})$

that there is

$\ignorespaces\lvert\delta_{2}\rvert\leq\bm{u}_{h}$

such that

[TABLE]

The vector subtraction and scaling when forming

$\ignorespaces r_{i}=Ax_{i}-\rho(x_{i})x_{i}$

yield two diagonal matrices

$\ignorespaces E$

and

$\ignorespaces E_{1}$

such that

[TABLE]

where

$\ignorespaces\lVert E\rVert_{2}\leq\bm{u}_{h}$

and

$\ignorespaces\lVert E_{1}\rVert_{2}\leq\bm{u}_{h}$

. Combined with (4), this concludes the proof because

[TABLE]

We model the inexact application of the preconditioner

$\ignorespaces T$

to $\hat{r}_{i}$ in the iteration (2) with the equation

[TABLE]

where

$\ignorespaces T_{E}$

depends on the choice of preconditioner

$\ignorespaces T$

and the way to compute

$\ignorespaces T\hat{r}_{i}$

. Note that

$\ignorespaces T_{E}$

also depends on

$\ignorespaces i$

.

Theorem 2.

*Consider the setting of Lemma 1 and (5). If

$\ignorespaces\lambda_{1}<\rho(x_{i})<\lambda_{2}$

and*

[TABLE]

with

[TABLE]

*then the computed result

$\ignorespaces\hat{x}_{i+1}$

of the PINVIT iteration (2) satisfies*

[TABLE]

**Proof: **By (2), (5), and Lemma 1, there exists a diagonal matrix

$\ignorespaces E_{0}$

(coming from the vector addition) such that

$\ignorespaces\lVert E_{0}\rVert\leq\bm{u}_{h}$

and

[TABLE]

where

$\ignorespaces\tilde{T}_{E}=(I+E_{0})T_{E}(I+E)$

. Setting $A_{\rho}=A-\rho(x_{i})I$ and using that $x_{i}=A_{\rho}^{-1}r_{i}$ , it follows that

[TABLE]

which takes the form of PINVIT with a perturbed preconditioner. This allows us to apply (AKNOZ2017, , Theorem 2.1), which requires the preconditioner to satisfy

[TABLE]

We now treat the different terms involved in (6) separately. First, we have

[TABLE]

By the assumptions, the spectral radius of

$\ignorespaces A_{\rho}^{-1}A^{1/2}$

is given by

[TABLE]

This allows us to bound the other terms in (6) as follows:

[TABLE]

Together with (7), this implies that the left-hand side of (6) is bounded by $\gamma<1$ and the statement of the theorem follows from (AKNOZ2017, , Theorem 2.1).

*Remark**.*

We remark that the conclusion of Theorem 2 does not imply that

$\ignorespaces\rho(x_{i})-\lambda_{1}$

can eventually drop below machine precision. For the relative error

$\ignorespaces\bigl{(}\rho({x}_{i})-\lambda_{1}\bigr{)}/\bigl{(}\lambda_{2}-\rho({x}_{i})\bigr{)}$

to be reduced by the factor

[TABLE]

during the $i$ th iteration, Theorem 2 requires that

[TABLE]

holds. This reduction takes place until a Rayleigh quotient

$\ignorespaces\rho(\hat{x})$

for an iterate

$\ignorespaces\hat{x}$

is produced for which

[TABLE]

For reasonable choices of $T_{E}$ , this means that the error is reduced until it reaches the level of working precision.

The quantity

$\ignorespaces\lVert I-A^{1/2}T_{E}A^{1/2}\rVert_{2}$

critically determines the convergence rate of PINVIT. The following lemma provides an estimate if $T_{E}$ corresponds to applying $A^{-1}$ in low precision via the Cholesky factorization.

Lemma 3.

*Suppose that the application of the preconditioner $T$ in one step of PINVIT (2) is implemented by applying $A^{-1}$ in low precision, via performing the Cholesky factorization of $A$ followed by forward and backward substitution. If

$\ignorespaces\epsilon_{T}:=4n(3n+1)\kappa(A)\bm{u}_{l}<1$

, where

$\ignorespaces\kappa(A)=\lVert A\rVert_{2}\lVert A^{-1}\rVert_{2}$

and $\bm{u}_{l}$ denotes unit roundoff in low precision, then*

[TABLE]

**Proof: **Using (H2002, , Theorem 10.4), there exists a symmetric matrix

$\ignorespaces E_{0}$

such that

[TABLE]

which means

$\ignorespaces T_{E}=(A+E_{0})^{-1}$

and, moreover,

[TABLE]

Then by

$\ignorespaces\bigl{\lVert}A^{-1/2}E_{0}A^{-1/2}\bigr{\rVert}_{2}\leq 4n(3n+1)\kappa(A)\bm{u}_{l}<1$

, we have

[TABLE]

Thus, it holds that

[TABLE]

5 Numerical experiments

In this section, we present numerical results for our mixed precision LOBPCG algorithm. In our tests, the working precision is IEEE double precision and the lower precision is IEEE single precision. Most tests are performed on a Linux server equipped with two twelve-core Intel Xeon E5-2670 v3 2.30 GHz CPUs and two Nvidia GeForce GTX 1080 GPUs. The tests in Section 5.5 also use an Nvidia A30 GPU. There are 128 GB of main memory on the CPUs and 11,178.6 MB of main memory on each GPU. Our program uses only one GPU and one thread on the CPU.

5.1 Experiment settings

In our experiments we compute a few smallest eigenvalues and the corresponding eigenvectors of Hermitian matrices using the LOBPCG algorithm. The following variants of the LOBPCG algorithm are tested:

DLOBPCG-dchol: LOBPCG algorithm performed entirely in double precision. 2. 2.

DLOBPCG-schol: LOBPCG algorithm performed in double precision, except for single precision preconditioning. 3. 3.

MPLOBPCG-schol: mixed precision LOBPCG algorithm (Algorithm 3) with single precision preconditioning and initial guess computed by the single precision LOBPCG algorithm; mixed precision orthogonalization (Algorithm 2) is also used.

When to computing

$\ignorespaces k$

eigenpairs, we run LOBPCG algorithm with a block size that is about

$\ignorespaces 50\%$

larger in order to enhance robustness. The algorithm terminates once the

$\ignorespaces k$

smallest eigenvalues and the corresponding eigenvectors converge. The convergence criterion is

[TABLE]

where

$\ignorespaces\lVert A\rVert_{2}$

is estimated through

$\ignorespaces\lVert A\rVert_{2}\approx\lVert\Omega A\rVert_{\mathsf{F}}/\lVert\Omega\rVert_{\mathsf{F}}$

using a Gaussian random matrix

$\ignorespaces\Omega\in\mathbb{C}^{m\times n}$

with

$\ignorespaces m\ll n$

. The threshold

$\ignorespaces\mathtt{tol}$

in (8) is set to

$\ignorespaces 10^{-12}$

for all these three algorithms, and is

$\ignorespaces 5\times 10^{-6}$

when computing a good initial guess for MPLOBPCG-schol.

In our tests, we use

$\ignorespaces\Pi^{-*}L^{-*}L^{-1}\Pi^{-1}$

as the preconditioner for Algorithm 3, where

$\ignorespaces\Pi$

is a permutation matrix, and

$\ignorespaces L$

is the (pivoted) Cholesky factor of

$\ignorespaces A$

satisfying

$\ignorespaces\Pi^{*}A\Pi=LL^{*}$

computed in single precision. The preconditioning stage in DLOBPCG-schol/MPLOBPCG-schol is to compute

$\ignorespaces W_{\mathtt{lower}}=\Pi L^{-*}L^{-1}\Pi^{*}\mathtt{lower}(R)$

by solving two triangular systems in single precision. In practice, we apply

$\ignorespaces\Pi$

to the given matrix

$\ignorespaces A$

instead of applying

$\ignorespaces\Pi$

to

$\ignorespaces\mathtt{lower}(R)$

in each iteration. We can benefit from it if

$\ignorespaces A$

is sufficiently sparse or the convergence of LOBPCG is not too rapid (i.e., it takes many iterations to converge).

We test the LOBPCG algorithm for both sparse matrices and dense matrices. Table 1 summarizes the software libraries used under different settings. The CHOLMOD package CDHR2008 can compute sparse Cholesky factorization on both CPU and GPU, while triangular linear solvers are only supported only in CPU. Note that CHOLMOD was developed only for double precision arithmetic; we have derived a single precision version for the purpose of our tests.

5.2 Advantage of mixed precision orthogonalization

Before discussing the LOBPCG algorithm, we first report the run time and savings of the mixed precision orthgonalization algorithm (i.e., Algorithm 2) in Table 2. We can see that for tall-skinny matrices Algorithm 2 can reduce the run time by a factor of

$\ignorespaces 1/4$

–

$\ignorespaces 1/3$

compared to DGEQRF in cuSOLVER. Thus, it is worth using this mixed approach for orthogonalization.

5.3 Tests for sparse matrices

We choose six sparse positive definite matrices from from the SuiteSparse Matrix Collection.111https://sparse.tamu.edu Table 3 shows the information of these sparse matrices. We compute

$\ignorespaces 30$

eigenpairs using a randomly generated initial guess with

$\ignorespaces 45$

columns for each matrix, and report the relative run time, which is the ratio of the wall clock time of a solver over the wall clock time of DLOBPCG-dchol.

Figures 1 and 2, respectively, show the relative run time on CPU and GPU. For all test cases, preconditioning in single precision reduces the execution time of the LOBPCG algorithm. Using an initial guess computed by the single precision LOBPCG algorithm and adopting mixed precision orthogonalization makes the algorithm more efficient. Compared to DLOBPCG-dchol, MPLOBPCG-schol is about

$\ignorespaces 1.43\times$

faster on CPU, and is about

$\ignorespaces 1.67\times$

faster on GPU.

We should also mention that the number of iterations for different variants of the LOBPCG algorithms are similar, though they are not shown in the figures. Sometimes MPLOBPCG-schol can require fewer iterations to converge because there is a restart when we use the lower precision result as the initial guess. For instance, the total iterations of DLOBPCG-dchol, DLOBPCG-schol and MPLOBPCG-schol are

$\ignorespaces 533$

,

$\ignorespaces 534$

, and

$\ignorespaces 461$

, respectively, for the 2D-Laplace matrix.

5.4 Tests for dense matrices

We also test the LOBPCG algorithm for a few dense matrices which are popular in machine learning. These dense matrices are kernel matrices generated by certain kernel functions as follows. Let

$\ignorespaces x_{1}$

,

$\ignorespaces x_{2}$

,

$\ignorespaces\dotsc$

,

$\ignorespaces x_{n}\in\mathbb{R}^{n}$

be uniform random vectors generated by

$\ignorespaces\mathtt{XLARNV}$

from LAPACK. We construct a matrix

$\ignorespaces K$

by applying the Gaussian kernel function

[TABLE]

Similarly, we can apply the polynomial kernel function

[TABLE]

to construct another kernel matrix. Using two sets of random vectors

$\ignorespaces\{{x_{1},x_{2},\dotsc,x_{n}\}}$

and

$\ignorespaces\{{y_{1},y_{2},\dotsc,y_{n}\}}$

in

$\ignorespaces\mathbb{R}^{n}$

, we also construct complex kernel matrices through

[TABLE]

where

$\ignorespaces k(\cdot,\cdot)$

is either the Gaussian kernel function or the polynomial kernel function.

We choose

$\ignorespaces n\in\{{1024,2048,4096,8192\}}$

in our experiments, and compute

$\ignorespaces 5n/1024$

smallest eigenvalues and the corresponding eigenvectors. The rank of initial guess is chosen as

$\ignorespaces 8n/1024$

accordingly. Figures 3, 4, and 5 show the relative run time of different variants of the LOBPCG algorithm. For real matrices, MPLOBPCG-schol achieves

$\ignorespaces 1.67\times$

and

$\ignorespaces 2\times$

speedup compared to DLOBPCG-dchol on CPU and GPU, respectively. The speedup is higher than that for sparse matrices, because dense matrices are more compute-intensive. The benefit for mixed precision approaches is more significant for complex matrices—the speedup becomes over

$\ignorespaces 2.5\times$

and up to

$\ignorespaces 5\times$

on GPU.

5.5 Tests on different GPUs

By far our tests are performed with an Nvidia GeForce GTX 1080 GPU, which is a consumer-grade GPU. In fact, there are two different types of GPU—consumer-grade and server-grade. Compared to consumer-grader GPUs server-grade GPUs usually have better hardware support for double precision arithmetic. Hence the performance difference between single and double precision arithmetic is larger on consumer-grade GPUs.

In the following we report some results collected from runs on an Nvidia A30 GPU, which is a server-grade one. We use the matrix 2D-Laplace in this test. By perturbing off-diagonal entries of this matrix by

$\ignorespaces\pm 10^{-16}\cdot\mathrm{i}$

, we also obtain a Hermtian positive definite matrix for testing complex arithmetic. From Figure 6, it can be seen that single precision has limited advantage over double precision on this server-grade GPU. Though MPLOBPCG-schol still achieves about

$\ignorespaces 1.3\times$

speedup compared to DLOBPCG-dchol, the benefit for adopting single precision arithmetic is much lower than that on NVIDIA GeForce GTX-1080 which is a consumer-grade GPU.

6 Conclusion

In this paper, we have proposed a mixed precision LOBPCG algorithm with a preconditioner based on a (sparse) Cholesky factorization. Both the initial guess and the preconditioner are computed in reduced precision. This largely improves the performance while it only has marginal impact on convergence. In our mixed precision LOBPCG algorithm, orthogonalization is also performed in a mixed precision manner to further improve performance. We analyze the rounding error of the PINVIT algorithm, which can be viewed as a simplified version of the LOBPCG algorithm, to confirm that our mixed precision algorithm is as accurate as the fixed precision one. Numerical experiments illustrate that adopting mixed precision arithmetic can significantly accelerate the execution of the LOBPCG algorithm on both CPUs and GPUs.

\bmhead

Acknowledgments The authors thank Erin Carson for helpful discussions. Part of this work was performed when the second author was visiting EPF Lausanne in 2022.

Yuxin Ma is partially supported by the State Scholarship Fund of China Scholarship Council (CSC) under Grant No. 202106100093, National Key R&D Program of China under Grant No. 2021YFA1003305 and National Natural Science Foundation of China under Grant No. 71991471. Meiyue Shao is partially supported by by the National Natural Science Foundation of China under grant No. 11971118.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1\bibcommenthead
2(1) Balcan, D., Gonçalves, B., Hu, H., Ramasco, J.J., Colizza, V., Vespignani, A.: Modeling the spatial spread of infectious diseases: the G Lobal Epidemic and Mobility computational model. J. Comput. Sci. 1 (3), 132–145 (2010). https://doi.org/10.1016/j.jocs.2010.07.002 · doi ↗
3(2) Knyazev, A.: Recent implementations, applications, and extensions of the locally optimal block preconditioned conjugate gradient method (LOBPCG). ar Xiv:1708.08354 (2017)
4(3) Saad, Y.: Numerical Methods for Large Eigenvalue Problems: Revised Edition. SIAM, Philadelphia, PA, USA (2011)
5(4) Neymeyr, K.: A geometric theory for preconditioned inverse iteration I: Extrema of the Rayleigh quotient. Linear Algebra Appl. 322 (1-3), 61–85 (2001). https://doi.org/10.1016/S 0024-3795(00)00239-1 · doi ↗
6(5) Neymeyr, K.: A geometric theory for preconditioned inverse iteration applied to a subspace. Math. Comp. 71 (237), 197–216 (2002). https://doi.org/10.1090/S 0025-5718-01-01357-6 · doi ↗
7(6) Argentati, M., Knyazev, A., Neymeyr, K., Ovtchinnikov, E., Zhou, M.: Convergence theory for preconditioned eigenvalue solvers in a nutshell. Found. Comput. Math. 17 , 713–727 (2017). https://doi.org/10.1007/s 10208-015-9297-1 · doi ↗
8(7) Knyazev, A.V.: Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23 (2), 517–541 (2001). https://doi.org/10.1137/S 1064827500366124 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A mixed precision LOBPCG algorithm

Abstract

keywords:

pacs:

1 Introduction

2 LOBPCG algorithm

3 Mixed precision algorithms

3.1 Lower precision preconditioning

3.2 A mixed precision orthogonalization procedure

3.3 A mixed precision LOBPCG algorithm

4 Convergence in finite-precision arithmetic

Lemma 1**.**

Theorem 2**.**

Remark*.*

Lemma 3**.**

5 Numerical experiments

5.1 Experiment settings

5.2 Advantage of mixed precision orthogonalization

5.3 Tests for sparse matrices

5.4 Tests for dense matrices

5.5 Tests on different GPUs

6 Conclusion

Lemma 1.

Theorem 2.

*Remark**.*

Lemma 3.