Efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product
Akira Imakura, Yusaku Yamamoto

TL;DR
This paper introduces optimized implementations of the modified Gram-Schmidt orthogonalization for non-standard inner products, reducing computational cost and maintaining high accuracy, with practical benefits demonstrated through experiments.
Contribution
It proposes $n$-matrix-vector multiplication implementations of MGS for non-standard inner products, improving efficiency and accuracy over naive methods.
Findings
The HA-type implementation achieves high accuracy with error bounds.
The HP-type implementation offers high performance with reduced computation.
Numerical experiments show competitive advantages in cost and accuracy.
Abstract
The modified Gram-Schmidt (MGS) orthogonalization is one of the most well-used algorithms for computing the thin QR factorization. MGS can be straightforwardly extended to a non-standard inner product with respect to a symmetric positive definite matrix . For the thin QR factorization of an matrix with the non-standard inner product, a naive implementation of MGS requires matrix-vector multiplications (MV) with respect to . In this paper, we propose -MV implementations: a high accuracy (HA) type and a high performance (HP) type, of MGS. We also provide error bounds of the HA-type implementation. Numerical experiments and analysis indicate that the proposed implementations have competitive advantages over the naive implementation in terms of both computational cost and accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Tensor decomposition and applications · Advanced Wireless Communication Techniques
Efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product
Akira Imakura
University of Tsukuba, Japan
Yusaku Yamamoto
The University of Electro-Communications, Japan
Abstract
The modified Gram-Schmidt (MGS) orthogonalization is one of the most well-used algorithms for computing the thin QR factorization. MGS can be straightforwardly extended to a non-standard inner product with respect to a symmetric positive definite matrix . For the thin QR factorization of an matrix with the non-standard inner product, a naive implementation of MGS requires matrix-vector multiplications (MV) with respect to . In this paper, we propose -MV implementations: a high accuracy (HA) type and a high performance (HP) type, of MGS. We also provide error bounds of the HA-type implementation. Numerical experiments and analysis indicate that the proposed implementations have competitive advantages over the naive implementation in terms of both computational cost and accuracy.
1 Introduction
In this paper, we consider computing the thin QR factorization with a non-standard inner product of the form
[TABLE]
where and is symmetric positive definite (spd). This type of QR factorization with a non-standard inner product (1) appears in weighted least squares problems [1, 5], projection methods for solving symmetric generalized eigenvalue problems [9, 8], the weighted (block) GMRES and FOM methods [4, 7] and so on.
For the standard inner product, i.e., , there are several established algorithms for computing the thin QR factorization [15, 1]. These methods can be classified into two groups: orthogonal triangularization methods such as the Householder transformation and triangular orthogonalization methods such as the Gram-Schmidt orthogonalization and the Cholesky QR algorithm. An extension of the Householder transformation for a quasimatrix has been developed by Trefethen [14] and it was shown to be applicable to (1) [17]. However, Trefethen’s Householder-type QR algorithm for (1) requires some -orthonormal basis that is a big issue to use it for general . In contrast, the methods in the second group can be straightforwardly extended to a non-standard inner product. The error bounds of these methods are also well analyzed in [11, 10, 16].
Here, we focus on the modified Gram-Schmidt (MGS) orthogonalization. For a standard inner product, the number of floating-point operations (flops) of MGS is . For the non-standard inner product, naive implementations of MGS additionally require matrix-vector multiplications (MV) with respect to [13], which is the most-time consuming part for general .
In this paper, we aim to reduce the computational cost of MGS. We propose high accuracy (HA) type and high performance (HP) type implementations of MGS that require only MV. We also provide error bounds of the HA-type implementation. One can also apply the proposed concept to the classical Gram-Schmidt (CGS) orthogonalization for its -MV implementations.
The remainder of this paper is organized as follows. In Section 2, we estimate the minimal computational cost for MGS and propose efficient implementations of MGS. We present error bounds of the proposed implementation in Section 3. Numerical results are reported in Section 4. Section 5 concludes the paper.
Throughout, the following notations are used. Let be spd and . Then, the -inner product of vectors and is defined as . Also, is the corresponding -norm. Norms without a subscript denote the 2-norm: and . Frobenius norm of a matrix is denoted by . For , we define the range space of the matrix by . If is of full column rank, then is the condition number of , where are the largest and smallest non-zero singular values of .
2 Efficient implementations of MGS
There are two types of implementations of MGS: a column-oriented (left-looking) version and a row-oriented (right-looking) version; see Algorithms 1 and 2 [1]. In this section, we firstly introduce naive implementations with MV. Then, we estimate the minimal computational cost for MGS and propose efficient implementations of MGS.
2.1 Naive implementations with MV
For the standard inner product, there is no numerical difference between the column- and row-oriented versions regarding computational cost, memory requirements and accuracy. Because the operations and rounding errors are the same, they produce exactly the same numerical results. On the other hand, each one has different advantages for using. The column-oriented MGS has advantages for successive orthogonalization and reorthogonalization; in contrast, the row-oriented MGS is suitable for column pivoting.
However, the situation is different for a non-standard inner product regarding computational cost and memory requirements. In naive implementations that uses no auxiliary vectors, the row-oriented MGS requires MV; in contrast, the column-oriented MGS requires MV to compute the -inner products:
[TABLE]
because depends on both and [13, 18]. On the other hand, if storing auxiliary vectors , is allowed, then the number of MV of the column-oriented MGS is reduced to by computing as
[TABLE]
because depends only on [13]. This achieves a -MV implementation of the column-oriented MGS.
Because the computational cost of MV is unreasonably large, we generally use the -MV implementations. Naive implementations with MV of the column- and row-oriented MGS are shown in Algorithms 3 and 4, respectively. Here, we note that they have the same computational cost and produce exactly the same numerical results.
2.2 Estimation of the minimal computational costs
In MGS, MV with respect to is used only for computing the -inner products and -norms to construct the elements of ,
[TABLE]
Then, we have the following proposition.
Proposition 1**.**
For each element of in (1), there exist such that
[TABLE]
Proof.
From the recurrence of MGS, holds for . Therefore, there exist such that
[TABLE]
which proves Proposition 1 because . ∎
Proposition 1 suggests the possibility of implementing MGS with only MV required for constructing the subspace . Therefore, we estimate the minimal computational costs for MGS to be
[TABLE]
whether the column- or row-oriented is used, because the number of flops for MGS with the standard inner product is .
2.3 -MV implementations of MGS: MGS-HA and MGS-HP
Here, we propose two types of -MV implementations of both the column- and row-oriented MGS: a high accuracy type (MGS-HA) and a high performance type (MGS-HP).
Firstly, we introduce a technique to achieve -MV implementations for the column-oriented MGS (Algorithm 3). In each iteration for , the column-oriented MGS requires two MV. One is for computing the -norm by
[TABLE]
and another is for computing the -inner product by
[TABLE]
Based on these formula and , we can compute without MV by
[TABLE]
which achieves an -MV implementation of the column-oriented MGS as shown in Algorithm 5.
Algorithm 5 has nearly the same error bounds as MGS-naive, as we will show in Section 3. In this sense, we call this a high accuracy type MGS, MGS-HA. The computational cost of MGS-HA is , which is the same as the estimated minimal computational cost (3). Therefore, regarding the computational cost, MGS-HA is an optimal implementation for MGS.
Although MGS-HA is optimal in terms of the computational cost, it performs one MV in each iteration of the loop. This sequential MV reduces the computational performance. On the other hand, Proposition 1 indicates that MV can be performed together because MV are required only for constructing the subspace . In other words, we firstly compute , then we can compute all the elements from the matrices and without MV.
To achieve this, we compute by
[TABLE]
where as shown in Algorithm 6. The computational cost of Algorithm 6 is , which is larger than that of MGS-HA. However, Algorithm 6 is expected to show higher computational performance and smaller computational time than MGS-HA (cost: ), because a matrix-matrix multiplication is much faster than the sequential MV. In this sense, we call this a high performance type MGS, MGS-HP.
We can derive -MV implementations of the row-oriented MGS in the same manner. The vector is computed without MV by
[TABLE]
as well as (4) and the vector is computed without a sequential MV by
[TABLE]
as well as (5) for MGS-HP(row). The algorithms of MGS-HA(row) and MGS-HP(row) are shown in Algorithms 7 and 8, respectively.
The proposed concept can also be applied to CGS for its -MV implementations: CGS-HA(col/row) and CGS-HP(col/row). It is also noted that the HP-type of row-oriented versions: MGS-HP(row) and CGS-HP(row), are equivalent to the algorithms introduced in [2] to use in the block conjugate gradient method for solving linear systems with multiple right-hand sides. However, the performance of these algorithms are not analyzed and evaluated in [2], because the main objective of [2] is to propose the block conjugate gradient method.
3 Analysis of error bounds
In this section, we present error bounds on the representation error and the loss of -orthogonality of MGS-HA (Algorithm 5) and show that MGS-HA has nearly the same error bounds as MGS-naive (Algorithm 3).
Let and let denote their counterparts computed in floating-point arithmetic. Also, we denote by and the matrix and the vector whose entries are absolute values of entries of and , respectively.
Assuming that , we use the following error bounds for scaling , inner product and MV computed in floating-point arithmetic:
[TABLE]
where is the unit rounding error and [6].
3.1 Upper bound of representation error
The recurrence formulas of and in Gram-Schmidt orthogonalization are written as
[TABLE]
These formulas are independent of the inner product used. They are also the same whether the naive implementation (MGS-naive, Algorithm 3) or the proposed implementation (MGS-HA, Algorithm 5) is used. The only difference between MGS-naive and MGS-HA lies in how to compute .
In [11, Theorem 3.1], an upper bound on the representation error of MGS-naive in floating-point arithmetic is derived as
[TABLE]
based only on (9) and (10). Because MGS-HA also uses (9) and (10), we have the same upper bound on the representation error of MGS-HA. It is to be noted that the upper bound (11) depends on the computed results , so it is an a posteriori error bound. Hence, (11) means that, if the norms of the computed results are nearly the same for both methods, they have nearly the same upper bounds.
Eqs. (9) and (10) are also the same for CGS-naive and CGS-HA. Therefore, we have the same upper bound of the representation error for CGS-naive and CGS-HA.
3.2 Upper bound of loss of -orthogonality
The main difference between MGS-naive and MGS-HA lies in how to compute for the strict upper triangular part (), because both of the methods compute the diagonal element by .
MGS-naive computes from and by
[TABLE]
In contrast, MGS-HA computes from the unnormalized vector by
[TABLE]
On the other hand, the vector is computed by normalization of as in MGS-naive, i.e.,
[TABLE]
According to [11, Theorem 3.2], the local errors of -inner product, AXPY (9) and scaling (10) are propagated by to be the loss of -orthogonality of MGS-naive, . Eqs. (9) and (10) are same for both methods. We can use the same evaluation for the norm of . Therefore, we just analyze the local error of the -inner product.
From the error bounds of MV and inner product, (8) and (7), Eqs. (12) and (13) in floating-point arithmetic can be written as
[TABLE]
From (17) and (18), an error bound of computed by MGS-naive, ignoring terms of , is derived [11] as
[TABLE]
where we used , and [6].
In contrast, formulas (14)–(16) of MGS-HA in floating-point arithmetic become
[TABLE]
These formulas compute from and . Because the local error of -inner product is defined as the difference between and , we also need a relationship between and , i.e.,
[TABLE]
Substituting (22), (21), (20) and (23) into in this order and ignoring terms of , we have
[TABLE]
Comparing (19) for MGS-naive and (24) for MGS-HA, we know that the only difference is in the coefficients:
[TABLE]
In [11], it is shown that the strict upper triangular part of the loss of -orthogonality , which is denoted as , can be bounded as
[TABLE]
where is a strict upper triangular matrix defined by
[TABLE]
and () and are floating-point errors arising in the AXPY operation (9) and scaling (10), respectively111The definition of given in [11] is actually the definition of . We corrected this in Eq. (26).. See the proof of Theorem 3.2 in [11] for details.
In Eqs. (25)–(27), the AXPY error () and the scaling errors and can be bounded by the same expression in both methods, because their computational formulas are the same. The norm can also be bounded in the same way in both methods. Hence, the only difference lies in the evaluation of the local error of , defined as . But comparing (19) with (24) reveals that the difference in this part is slight. In addition, the diagonal part of is nothing but the scaling error and has the same bound for both methods. Thus we can conclude that MGS-HA has the same a posteriori bound for the loss of -orthogonality as MGS-naive [11]:
[TABLE]
provided that .
3.3 Analysis of CGS
For a variant of CGS, CGS-P [12], that computes the diagonal element in a different way from the original CGS, error bounds for a non-standard inner product are given in [11]. On the other hand, error bounds of original CGS have not been well analyzed yet for a non-standard inner product.
However, we can estimate the influence of the proposed approach on the error bounds of CGS. As in the case of MGS, the only difference between CGS-naive and CGS-HA is how to compute . For both CGS-naive and CGS-HA, the recurrence formulas are obtained from those of MGS-naive and MGS-HA, respectively, by changing to . Therefore, the local error in the computation of can be evaluated by (19) and (24) by changing to . Thus, the local errors of are nearly the same for both CGS-naive and CGS-HA. As a result, we can expect that CGS-HA has nearly the same loss of -orthogonality as CGS-naive.
4 Numerical experiments
In this section, we evaluate the computational performance of MGS-HA (Algorithm 5) and MGS-HP (Algorithm 6). In particular, we compare the computation time and the loss of -orthogonality of these methods with those of MGS-naive (Algorithm 3), CGS-naive and Cholesky QR, the last of which is one of the fastest algorithms for (1).
4.1 Numerical experiment I
Firstly, we compare the computation time of MGS-naive, MGS-HA, MGS-HP and Cholesky QR for two different problems. For the first problem, is a random dense spd matrix with . For the second problem, is a sparse spd matrix AUNW9180 obtained from ELSES matrix library [3]. This is an overlap matrix in an electronic structure calculation of a helical multishell gold nanowire. The size of the matrix is and the number of non-zero entries is . For both problems, we set to be a random dense matrix. We test .
All the numerical experiments were carried out in double precision arithmetic on OS: CentOS 64bit, CPU: Intel Xeon CPU E5-2667 3.20GHz (1 core), Memory: 48GB. We used Intel MKL for matrix computations and Mersenne twister for generating random matrices.
Figure 1 shows the computation time for both problems, while Figure 2 shows breakdown of the computation time scaled by the total computation time of MGS-naive for each . When , most of the computation time is used for computing MV and hence the total time increases proportionally to ; see the left columns of Figure 1 and Figure 2. In this situation, MGS-HA achieves 2x speedup over MGS-naive. MGS-HP and Cholesky QR are even faster and show drastic speedup over these methods. On the other hand, as becomes larger, the ratio of computation time for other parts increases, especially for the sparse problem. In this situation, the speedup ratio of the proposed methods becomes relatively small, although both methods are still faster than MGS-naive.
4.2 Numerical experiment II
Next, we compare the loss of -orthogonality
[TABLE]
of MGS-naive, MGS-HA, MGS-HP, CGS-naive and Cholesky QR. Let be a random orthogonal matrix. Then, we set as
[TABLE]
where
[TABLE]
so that are evenly spaced. Also, let be a random orthogonal matrix and be matrices whose columns are eigenvectors of corresponding to the largest and the smallest eigenvalues, respectively. Then, we set as
[TABLE]
where
[TABLE]
Case 1 and case 2 provide a best case and a worst case with respect to the loss of -orthogonality, respectively [10]. We set and test problems with for each case.
All the numerical experiments were carried out in MATLAB2016a. We used Mersenne twister for generating random matrices.
We present log10 of the loss of -orthogonality as a function of and for case 1 and case 2 in Figures 3 and 4, respectively. In case 1 (the best case), the loss of -orthogonality of all methods depends only on ; in contrast, it depends on both and in case 2 (the worst case). In both cases, MGS-naive, MGS-HA and MGS-HP show better accuracy than CGS-naive and Cholesky QR: the dependence on is linear for the former and quadratic for the latter. Here, we note that Cholesky QR failed when .
Next, we compare the proposed implementations, MGS-HA and MGS-HP, with MGS-naive. MGS-HP shows nearly the same accuracy as MGS-naive in both cases and MGS-HA shows nearly the same accuracy as MGS-naive in case 1. In addition, as a remarkable result, we observe that MGS-HA shows better accuracy than MGS-naive in case 2, especially when both and are ill-conditioned: ; see Figure 4(b).
4.3 Numerical experiment III
Here, we compare the loss of -orthogonality of the computed results of MGS-naive, MGS-HA and MGS-HP in case 2 with the theoretical error bound (28) derived in Section 3. As shown in Section 3, the loss of -orthogonality of MGS-naive and MGS-HA are bounded by
[TABLE]
Figure 6 shows the upper bound as a function of and , while Figure 6 plots the actual loss of -orthogonality against . Comparing Figures 4(a), (c) and 6 reveals that represents the actual loss of -orthogonality well for MGS-naive and MGS-HP. In fact, we can see from Figure 6(a), (c) that is not only an upper bound, but also a good estimate of the actual loss of -orthogonality for MGS-naive and MGS-HP. For MGS-HA, however, there are many computational results for which the loss of -orthogonality is much lower than suggested by ; see Figure 6(b). This indicates that although is certainly an upper bound for MGS-HA, it may not be a sharp upper bound.
Instead of , let us consider the following quantity:
[TABLE]
Figure 8 shows as a function of and , while Figure 8 plots the actual loss of -orthogonality against . Comparing Figures 4(b) and 8, we can see that represents the computational results of MGS-HA well. We also observe, from Figure 8(b), that is a very sharp upper bound for MGS-HA. It is to be noted that does not serve as an upper bound of the loss of -orthogonality for MGS-naive and MGS-HP; see Figure 8(a), (c).
Although the quantity we introduced here has no theoretical background yet, the numerical results suggest that it can describe the actual loss of -orthogonality for MGS-HA very well. Based on this fact, we make the following conjecture on a sharper upper bound for the loss of -orthogonality of MGS-HA.
Conjecture 1**.**
The loss of -orthogonality of MGS-HA can be bounded as
[TABLE]
5 Conclusions
In this paper, we propose two types of efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product. These methods, named MGS-HA and MGS-HP, require only MV, in contrast to the naive implementation, MGS-naive, that requires MV. Experimental results show that both methods are much faster than MGS-naive. Specifically, MGS-HP is nearly as fast as Cholesky QR for small . Regarding accuracy, we prove that MGS-HA has nearly the same error bounds for representation error and loss of -orthogonality as MGS-naive. According to the numerical experiments, MGS-HP shows nearly the same accuracy and MGS-HA shows higher accuracy than MGS-naive. We also introduce a conjecture on a sharper upper bound for the loss of -orthogonality for MGS-HA (Conjecture 1).
In the future, we expect to prove Conjecture 1 and also derive an upper bound for MGS-HP. We also plan to evaluate the computational performance of the proposed implementations for large problems in parallel environments.
Acknowledgment
The present study is supported in part by Japan Science and Technology Agency, ACT-I (No. JPMJPR16U6) and the Japanese Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Scientific Research (Nos. 26286087, 15H02708, 15H02709, 16KT0016).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Å. Björck, Numerical Methods for Least Squares Problems, SIAM, 1996.
- 2[2] A. A. Dubrulle, Retooling the method of block conjugate gradients, ETNA, 12 (2001), 216–233.
- 3[3] ELSES matrix library, http://www.elses.jp/matrix/ .
- 4[4] A. Essai, Weighted FOM and GMRES for solving nonsymmetric linear systems, Numer. Alg., 18 (1998), 277–292.
- 5[5] M. Gulliksson, On the modified Gram-Schmidt algorithm for weighted and constrained linear least squares problems, BIT Numerical Mathematics, 35 (1995) 453–468.
- 6[6] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002.
- 7[7] A. Imakura, L. Du, H. Tadano, A Weighted Block GMRES method for solving linear systems with multiple right-hand sides, JSIAM Letters, 5 (2013), 65–68.
- 8[8] A. Imakura, T. Sakurai, Block Krylov-type complex moment-based eigensolvers for solving generalized eigenvalue problems, Numer. Alg., (accepted).
