Efficient implementations of the modified Gram-Schmidt orthogonalization   with a non-standard inner product

Akira Imakura; Yusaku Yamamoto

arXiv:1703.10440·math.NA·March 31, 2017

Efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product

Akira Imakura, Yusaku Yamamoto

PDF

Open Access

TL;DR

This paper introduces optimized implementations of the modified Gram-Schmidt orthogonalization for non-standard inner products, reducing computational cost and maintaining high accuracy, with practical benefits demonstrated through experiments.

Contribution

It proposes $n$-matrix-vector multiplication implementations of MGS for non-standard inner products, improving efficiency and accuracy over naive methods.

Findings

01

The HA-type implementation achieves high accuracy with error bounds.

02

The HP-type implementation offers high performance with reduced computation.

03

Numerical experiments show competitive advantages in cost and accuracy.

Abstract

The modified Gram-Schmidt (MGS) orthogonalization is one of the most well-used algorithms for computing the thin QR factorization. MGS can be straightforwardly extended to a non-standard inner product with respect to a symmetric positive definite matrix $A$ . For the thin QR factorization of an $m \times n$ matrix with the non-standard inner product, a naive implementation of MGS requires $2 n$ matrix-vector multiplications (MV) with respect to $A$ . In this paper, we propose $n$ -MV implementations: a high accuracy (HA) type and a high performance (HP) type, of MGS. We also provide error bounds of the HA-type implementation. Numerical experiments and analysis indicate that the proposed implementations have competitive advantages over the naive implementation in terms of both computational cost and accuracy.

Equations98

Z = QR, Q^{T} A Q = I_{n},

Z = QR, Q^{T} A Q = I_{n},

r_{ij} = (q_{i}, z_{j}^{(i - 1)})_{A} = q_{i}^{T} (A z_{j}^{(i - 1)}) (i < j),

r_{ij} = (q_{i}, z_{j}^{(i - 1)})_{A} = q_{i}^{T} (A z_{j}^{(i - 1)}) (i < j),

r_{ij} = (q_{i}, z_{j}^{(i - 1)})_{A} = (A q_{i})^{T} z_{j}^{(i - 1)} (i < j),

r_{ij} = (q_{i}, z_{j}^{(i - 1)})_{A} = (A q_{i})^{T} z_{j}^{(i - 1)} (i < j),

r_{ij} = (q_{i}, z_{j}^{(i - 1)})_{A} (i < j),

r_{ij} = (q_{i}, z_{j}^{(i - 1)})_{A} (i < j),

r_{j j} = ∥ z_{j}^{(j - 1)} ∥_{A} = (z_{j}^{(j - 1)}, z_{j}^{(j - 1)})_{A} .

r_{ij} = (a, b)_{2} .

r_{ij} = (a, b)_{2} .

r_{ij} = (x, y)_{A} = (x, A y)_{2},

r_{ij} = (x, y)_{A} = (x, A y)_{2},

\mbox minima l cos t s f or M GS : n \mbox M V + 2 m n^{2} \mbox f l o p s,

\mbox minima l cos t s f or M GS : n \mbox M V + 2 m n^{2} \mbox f l o p s,

x_{j}^{(j - 1)} = A z_{j}^{(j - 1)},

x_{j}^{(j - 1)} = A z_{j}^{(j - 1)},

r_{j j} = (z_{j}^{(j - 1)})^{T} x_{j}^{(j - 1)},

p_{j} = A q_{j},

p_{j} = A q_{j},

r_{ij} = p_{i}^{T} z_{j}^{(i - 1)} (i < j) .

p_{j} = A q_{j} = A \frac{z _{j}^{(j - 1)}}{r _{j j}} = \frac{A z _{j}^{(j - 1)}}{r _{j j}} = \frac{x _{j}^{(j - 1)}}{r _{j j}},

p_{j} = A q_{j} = A \frac{z _{j}^{(j - 1)}}{r _{j j}} = \frac{A z _{j}^{(j - 1)}}{r _{j j}} = \frac{x _{j}^{(j - 1)}}{r _{j j}},

x_{j}^{(i)} = A z_{j}^{(i)} = A (z_{j}^{(0)} - i = 1 \sum j - 1 r_{ij} q_{i}) = A z_{j}^{(0)} - i = 1 \sum j - 1 r_{ij} A q_{i} = x_{j}^{(0)} - i = 1 \sum j - 1 r_{ij} p_{i},

x_{j}^{(i)} = A z_{j}^{(i)} = A (z_{j}^{(0)} - i = 1 \sum j - 1 r_{ij} q_{i}) = A z_{j}^{(0)} - i = 1 \sum j - 1 r_{ij} A q_{i} = x_{j}^{(0)} - i = 1 \sum j - 1 r_{ij} p_{i},

p_{i} (= A q_{i}) = \frac{x _{i}^{(i - 1)}}{r _{ii}},

p_{i} (= A q_{i}) = \frac{x _{i}^{(i - 1)}}{r _{ii}},

x_{i}^{(i - 1)} (= A z_{i}^{(i - 1)}) = x_{i}^{(0)} - j = i + 1 \sum n r_{ij} p_{i}

x_{i}^{(i - 1)} (= A z_{i}^{(i - 1)}) = x_{i}^{(0)} - j = i + 1 \sum n r_{ij} p_{i}

y = α x + Δ y, ∣Δ y ∣ \leq u ∣ α ∣∣ x ∣,

y = α x + Δ y, ∣Δ y ∣ \leq u ∣ α ∣∣ x ∣,

α = x^{T} y + Δ α, ∣Δ α ∣ \leq γ_{m} ∣ x ∣^{T} ∣ y ∣,

y = A x + Δ y, ∣Δ y ∣ \leq γ_{m} ∣ A ∣∣ x ∣,

z_{j}^{(i)} = z_{j}^{(i - 1)} - r_{ij} q_{i} (i = 1, 2, \dots, j - 1),

z_{j}^{(i)} = z_{j}^{(i - 1)} - r_{ij} q_{i} (i = 1, 2, \dots, j - 1),

q_{j} = \frac{z _{j}^{(j - 1)}}{r _{j j}} .

∥ Z - Q R ∥ \leq O (n^{3/2}) (∥ Z ∥ + ∥ Q ∥∥ R ∥)

∥ Z - Q R ∥ \leq O (n^{3/2}) (∥ Z ∥ + ∥ Q ∥∥ R ∥)

p_{i} = A q_{i},

p_{i} = A q_{i},

r_{ij} = p_{i}^{T} z_{j}^{(i - 1)} .

x_{i}^{(i - 1)} = A z_{i}^{(i - 1)},

x_{i}^{(i - 1)} = A z_{i}^{(i - 1)},

p_{i} = \frac{x _{i}^{(i - 1)}}{r _{ii}},

r_{ij} = p_{i}^{T} z_{j}^{(i - 1)} .

q_{i} = \frac{z _{i}^{(i - 1)}}{r _{ii}} .

q_{i} = \frac{z _{i}^{(i - 1)}}{r _{ii}} .

p_{i} = A q_{i} + Δ p_{i}, ∣Δ p_{i} ∣ \leq γ_{m} ∣ A ∣∣ q_{i} ∣,

p_{i} = A q_{i} + Δ p_{i}, ∣Δ p_{i} ∣ \leq γ_{m} ∣ A ∣∣ q_{i} ∣,

r_{ij} = p_{i}^{T} z_{j}^{(i - 1)} + Δ r_{ij}, ∣Δ r_{ij} ∣ \leq γ_{m} ∣ p_{i} ∣^{T} ∣ z_{j}^{(i - 1)} ∣.

∣ r_{ij} - q_{i}^{T} A z_{j}^{(i - 1)} ∣

∣ r_{ij} - q_{i}^{T} A z_{j}^{(i - 1)} ∣

\leq ∣ (Δ p_{i})^{T} z_{j}^{(i - 1)} ∣ + ∣Δ r_{ij} ∣

\leq ∥Δ p_{i} ∥∥ z_{j}^{(i - 1)} ∥ + γ_{m} ∥ p_{i} ∥∥ z_{j}^{(i - 1)} ∥

\leq γ_{m} ∥ ∣ A ∣ ∥∥ q_{i} ∥∥ z_{j}^{(i - 1)} ∥ + γ_{m} ∥ A ∥∥ q_{i} ∥∥ z_{j}^{(i - 1)} ∥

\leq γ_{m m + m} ∥ A ∥∥ q_{i} ∥∥ z_{j}^{(i - 1)} ∥,

x_{i}^{(i - 1)} = A z_{i}^{(i - 1)} + Δ x_{i}^{(i - 1)}, ∣Δ x_{i}^{(i - 1)} ∣ \leq γ_{m} ∣ A ∣∣ z_{i}^{(i - 1)} ∣,

x_{i}^{(i - 1)} = A z_{i}^{(i - 1)} + Δ x_{i}^{(i - 1)}, ∣Δ x_{i}^{(i - 1)} ∣ \leq γ_{m} ∣ A ∣∣ z_{i}^{(i - 1)} ∣,

p_{i} = \frac{x _{i}^{(i - 1)}}{r _{ii}} + Δ p_{i}, ∣Δ p_{i} ∣ \leq u \frac{∣ x _{i}^{(i - 1)} ∣}{∣ r _{ii} ∣},

r_{ij} = p_{i}^{T} z_{j}^{(i - 1)} + Δ r_{ij}, ∣Δ r_{ij} ∣ \leq γ_{m} ∣ p_{i} ∣^{T} ∣ z_{j}^{(i - 1)} ∣.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMatrix Theory and Algorithms · Tensor decomposition and applications · Advanced Wireless Communication Techniques

Full text

Efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product

Akira Imakura

University of Tsukuba, Japan

Yusaku Yamamoto

The University of Electro-Communications, Japan

[email protected]

Abstract

The modified Gram-Schmidt (MGS) orthogonalization is one of the most well-used algorithms for computing the thin QR factorization. MGS can be straightforwardly extended to a non-standard inner product with respect to a symmetric positive definite matrix $A$ . For the thin QR factorization of an $m\times n$ matrix with the non-standard inner product, a naive implementation of MGS requires $2n$ matrix-vector multiplications (MV) with respect to $A$ . In this paper, we propose $n$ -MV implementations: a high accuracy (HA) type and a high performance (HP) type, of MGS. We also provide error bounds of the HA-type implementation. Numerical experiments and analysis indicate that the proposed implementations have competitive advantages over the naive implementation in terms of both computational cost and accuracy.

1 Introduction

In this paper, we consider computing the thin QR factorization with a non-standard inner product of the form

[TABLE]

where $Z,Q\in\mathbb{R}^{m\times n}$ $(m\geq n),R\in\mathbb{R}^{n\times n}$ and $A\in\mathbb{R}^{m\times m}$ is symmetric positive definite (spd). This type of QR factorization with a non-standard inner product (1) appears in weighted least squares problems [1, 5], projection methods for solving symmetric generalized eigenvalue problems [9, 8], the weighted (block) GMRES and FOM methods [4, 7] and so on.

For the standard inner product, i.e., $A=I_{m}$ , there are several established algorithms for computing the thin QR factorization [15, 1]. These methods can be classified into two groups: orthogonal triangularization methods such as the Householder transformation and triangular orthogonalization methods such as the Gram-Schmidt orthogonalization and the Cholesky QR algorithm. An extension of the Householder transformation for a quasimatrix has been developed by Trefethen [14] and it was shown to be applicable to (1) [17]. However, Trefethen’s Householder-type QR algorithm for (1) requires some $A$ -orthonormal basis that is a big issue to use it for general $A$ . In contrast, the methods in the second group can be straightforwardly extended to a non-standard inner product. The error bounds of these methods are also well analyzed in [11, 10, 16].

Here, we focus on the modified Gram-Schmidt (MGS) orthogonalization. For a standard inner product, the number of floating-point operations (flops) of MGS is $2mn^{2}$ . For the non-standard inner product, naive implementations of MGS additionally require $2n$ matrix-vector multiplications (MV) with respect to $A$ [13], which is the most-time consuming part for general $A$ .

In this paper, we aim to reduce the computational cost of MGS. We propose high accuracy (HA) type and high performance (HP) type implementations of MGS that require only $n$ MV. We also provide error bounds of the HA-type implementation. One can also apply the proposed concept to the classical Gram-Schmidt (CGS) orthogonalization for its $n$ -MV implementations.

The remainder of this paper is organized as follows. In Section 2, we estimate the minimal computational cost for MGS and propose efficient implementations of MGS. We present error bounds of the proposed implementation in Section 3. Numerical results are reported in Section 4. Section 5 concludes the paper.

Throughout, the following notations are used. Let $A\in\mathbb{R}^{m\times m}$ be spd and ${\bm{x}},{\bm{y}}\in\mathbb{R}^{m}$ . Then, the $A$ -inner product of vectors ${\bm{x}}$ and ${\bm{y}}$ is defined as $({\bm{x}},{\bm{y}})_{A}:={\bm{x}}^{\rm T}A{\bm{y}}$ . Also, $\|{\bm{x}}\|_{A}:=\sqrt{({\bm{x}},{\bm{x}})_{A}}=\sqrt{{\bm{x}}^{\rm T}A{\bm{x}}}$ is the corresponding $A$ -norm. Norms without a subscript denote the 2-norm: $\|{\bm{x}}\|:=\|{\bm{x}}\|_{2}$ and $\|A\|:=\|A\|_{2}$ . Frobenius norm of a matrix $A$ is denoted by $\|A\|_{\rm F}$ . For $Z=[{\bm{z}}_{1},{\bm{z}}_{2},\ldots,{\bm{z}}_{n}]\in\mathbb{R}^{m\times n}$ , we define the range space of the matrix $Z$ by $\mathcal{R}(Z):={\rm span}\{{\bm{z}}_{1},{\bm{z}}_{2},$ $\dots,{\bm{z}}_{n}\}$ . If $Z$ is of full column rank, then $\kappa(Z):=\sigma_{1}/\sigma_{n}$ is the condition number of $Z$ , where $\sigma_{1},\sigma_{n}$ are the largest and smallest non-zero singular values of $Z$ .

2 Efficient implementations of MGS

There are two types of implementations of MGS: a column-oriented (left-looking) version and a row-oriented (right-looking) version; see Algorithms 1 and 2 [1]. In this section, we firstly introduce naive implementations with $2n$ MV. Then, we estimate the minimal computational cost for MGS and propose efficient implementations of MGS.

2.1 Naive implementations with $2n$ MV

For the standard inner product, there is no numerical difference between the column- and row-oriented versions regarding computational cost, memory requirements and accuracy. Because the operations and rounding errors are the same, they produce exactly the same numerical results. On the other hand, each one has different advantages for using. The column-oriented MGS has advantages for successive orthogonalization and reorthogonalization; in contrast, the row-oriented MGS is suitable for column pivoting.

However, the situation is different for a non-standard inner product regarding computational cost and memory requirements. In naive implementations that uses no auxiliary vectors, the row-oriented MGS requires $2n$ MV; in contrast, the column-oriented MGS requires $\mathcal{O}(n^{2})$ MV to compute the $A$ -inner products:

[TABLE]

because ${\bm{z}}_{j}^{(i-1)}$ depends on both $i$ and $j$ [13, 18]. On the other hand, if storing $n$ auxiliary vectors $A{\bm{q}}_{j},j=1,2,\dots,n$ , is allowed, then the number of MV of the column-oriented MGS is reduced to $2n$ by computing $r_{ij}$ as

[TABLE]

because ${\bm{q}}_{i}$ depends only on $i$ [13]. This achieves a $2n$ -MV implementation of the column-oriented MGS.

Because the computational cost of $\mathcal{O}(n^{2})$ MV is unreasonably large, we generally use the $2n$ -MV implementations. Naive implementations with $2n$ MV of the column- and row-oriented MGS are shown in Algorithms 3 and 4, respectively. Here, we note that they have the same computational cost and produce exactly the same numerical results.

2.2 Estimation of the minimal computational costs

In MGS, MV with respect to $A$ is used only for computing the $A$ -inner products and $A$ -norms to construct the elements of $R$ ,

[TABLE]

Then, we have the following proposition.

Proposition 1.

For each element $r_{ij}$ of $R$ in (1), there exist ${\bm{a}}\in\mathcal{R}(Z),{\bm{b}}\in\mathcal{R}(AZ)$ such that

[TABLE]

Proof.

From the recurrence of MGS, ${\bm{q}}_{i},{\bm{z}}_{j}^{(i-1)}\in\mathcal{R}(Z)$ holds for $1\leq i\leq j\leq n$ . Therefore, there exist ${\bm{x}},{\bm{y}}\in\mathcal{R}(Z)$ such that

[TABLE]

which proves Proposition 1 because $A{\bm{y}}\in\mathcal{R}(AZ)$ . ∎

Proposition 1 suggests the possibility of implementing MGS with only $n$ MV required for constructing the subspace $\mathcal{R}(AZ)$ . Therefore, we estimate the minimal computational costs for MGS to be

[TABLE]

whether the column- or row-oriented is used, because the number of flops for MGS with the standard inner product is $2mn^{2}$ .

2.3 $n$ -MV implementations of MGS: MGS-HA and MGS-HP

Here, we propose two types of $n$ -MV implementations of both the column- and row-oriented MGS: a high accuracy type (MGS-HA) and a high performance type (MGS-HP).

Firstly, we introduce a technique to achieve $n$ -MV implementations for the column-oriented MGS (Algorithm 3). In each iteration for $j$ , the column-oriented MGS requires two MV. One is for computing the $A$ -norm $r_{jj}=\|{\bm{z}}_{j}^{(j-1)}\|_{A}$ by

[TABLE]

and another is for computing the $A$ -inner product $r_{ij}=({\bm{q}}_{i},{\bm{z}}_{j}^{(j-1)})_{A}$ $(i<j)$ by

[TABLE]

Based on these formula and ${\bm{q}}_{j}={\bm{z}}_{j}^{(j-1)}/r_{jj}$ , we can compute ${\bm{p}}_{j}=A{\bm{q}}_{j}$ without MV by

[TABLE]

which achieves an $n$ -MV implementation of the column-oriented MGS as shown in Algorithm 5.

Algorithm 5 has nearly the same error bounds as MGS-naive, as we will show in Section 3. In this sense, we call this a high accuracy type MGS, MGS-HA. The computational cost of MGS-HA is $n\mbox{ MV}+2mn^{2}\mbox{ flops}$ , which is the same as the estimated minimal computational cost (3). Therefore, regarding the computational cost, MGS-HA is an optimal implementation for MGS.

Although MGS-HA is optimal in terms of the computational cost, it performs one MV in each iteration of the $j$ loop. This sequential MV reduces the computational performance. On the other hand, Proposition 1 indicates that $n$ MV can be performed together because MV are required only for constructing the subspace $\mathcal{R}(AZ)$ . In other words, we firstly compute $AZ$ , then we can compute all the elements $r_{ij}$ from the matrices $Z$ and $AZ$ without MV.

To achieve this, we compute ${\bm{x}}_{j}^{(i)}=A{\bm{z}}_{j}^{(i)}$ by

[TABLE]

where $X=[{\bm{x}}_{1}^{(0)},{\bm{x}}_{2}^{(0)},\dots,{\bm{x}}_{n}^{(0)}]=AZ$ as shown in Algorithm 6. The computational cost of Algorithm 6 is $n\mbox{ MV}+3mn^{2}\mbox{ flops}$ , which is larger than that of MGS-HA. However, Algorithm 6 is expected to show higher computational performance and smaller computational time than MGS-HA (cost: $n\mbox{ MV}+2mn^{2}\mbox{ flops}$ ), because a matrix-matrix multiplication is much faster than the sequential MV. In this sense, we call this a high performance type MGS, MGS-HP.

We can derive $n$ -MV implementations of the row-oriented MGS in the same manner. The vector ${\bm{p}}_{i}=A{\bm{q}}_{i}$ is computed without MV by

[TABLE]

as well as (4) and the vector ${\bm{x}}_{i}^{(i-1)}=A{\bm{z}}_{i}^{(i-1)}$ is computed without a sequential MV by

[TABLE]

as well as (5) for MGS-HP(row). The algorithms of MGS-HA(row) and MGS-HP(row) are shown in Algorithms 7 and 8, respectively.

The proposed concept can also be applied to CGS for its $n$ -MV implementations: CGS-HA(col/row) and CGS-HP(col/row). It is also noted that the HP-type of row-oriented versions: MGS-HP(row) and CGS-HP(row), are equivalent to the algorithms introduced in [2] to use in the block conjugate gradient method for solving linear systems with multiple right-hand sides. However, the performance of these algorithms are not analyzed and evaluated in [2], because the main objective of [2] is to propose the block conjugate gradient method.

3 Analysis of error bounds

In this section, we present error bounds on the representation error and the loss of $A$ -orthogonality of MGS-HA (Algorithm 5) and show that MGS-HA has nearly the same error bounds as MGS-naive (Algorithm 3).

Let $\alpha\in\mathbb{R},{\bm{x}}\in\mathbb{R}^{m},A\in\mathbb{R}^{m\times n}$ and let $\widehat{\alpha}\in\mathbb{R},\widehat{\bm{x}}\in\mathbb{R}^{m},\widehat{A}\in\mathbb{R}^{m\times n}$ denote their counterparts computed in floating-point arithmetic. Also, we denote by $|A|$ and $|{\bm{x}}|$ the matrix and the vector whose entries are absolute values of entries of $A$ and ${\bm{x}}$ , respectively.

Assuming that $\alpha\in\mathbb{R},{\bm{x}},{\bm{y}}\in\mathbb{R}^{m},A\in\mathbb{R}^{m\times m}$ , we use the following error bounds for scaling ${\bm{y}}=\alpha{\bm{x}}$ , inner product $\alpha={\bm{x}}^{\rm T}{\bm{y}}$ and MV ${\bm{y}}=A{\bm{x}}$ computed in floating-point arithmetic:

[TABLE]

where ${\bf u}$ is the unit rounding error and $\gamma_{m}:=m{\bf u}/(1-m{\bf u})\approx m{\bf u}$ [6].

3.1 Upper bound of representation error

The recurrence formulas of ${\bm{z}}_{j}^{(i)}$ and ${\bm{q}}_{j}$ in Gram-Schmidt orthogonalization are written as

[TABLE]

These formulas are independent of the inner product used. They are also the same whether the naive implementation (MGS-naive, Algorithm 3) or the proposed implementation (MGS-HA, Algorithm 5) is used. The only difference between MGS-naive and MGS-HA lies in how to compute $r_{ij}$ .

In [11, Theorem 3.1], an upper bound on the representation error of MGS-naive in floating-point arithmetic is derived as

[TABLE]

based only on (9) and (10). Because MGS-HA also uses (9) and (10), we have the same upper bound on the representation error of MGS-HA. It is to be noted that the upper bound (11) depends on the computed results $\widehat{Q},\widehat{R}$ , so it is an a posteriori error bound. Hence, (11) means that, if the norms of the computed results $\widehat{Q},\widehat{R}$ are nearly the same for both methods, they have nearly the same upper bounds.

Eqs. (9) and (10) are also the same for CGS-naive and CGS-HA. Therefore, we have the same upper bound of the representation error for CGS-naive and CGS-HA.

3.2 Upper bound of loss of $A$ -orthogonality

The main difference between MGS-naive and MGS-HA lies in how to compute $r_{ij}$ for the strict upper triangular part ( $i<j$ ), because both of the methods compute the diagonal element by $r_{jj}=({\bm{z}}_{j}^{(j-1)})^{\rm T}A{\bm{z}}_{j}^{(j-1)}$ .

MGS-naive computes $r_{ij}$ from ${\bm{q}}_{i}$ and ${\bm{z}}_{j}^{(i-1)}$ $(i<j)$ by

[TABLE]

In contrast, MGS-HA computes $r_{ij}$ from the unnormalized vector ${\bm{z}}_{j}^{(j-1)}$ by

[TABLE]

On the other hand, the vector ${\bm{q}}_{i}$ is computed by normalization of ${\bm{z}}_{i}^{(i-1)}$ as in MGS-naive, i.e.,

[TABLE]

According to [11, Theorem 3.2], the local errors of $A$ -inner product, AXPY (9) and scaling (10) are propagated by $\widehat{R}^{-1}$ to be the loss of $A$ -orthogonality of MGS-naive, $\widehat{Q}^{\rm T}A\widehat{Q}-I_{n}$ . Eqs. (9) and (10) are same for both methods. We can use the same evaluation for the norm of $\widehat{R}^{-1}$ . Therefore, we just analyze the local error of the $A$ -inner product.

From the error bounds of MV and inner product, (8) and (7), Eqs. (12) and (13) in floating-point arithmetic can be written as

[TABLE]

From (17) and (18), an error bound of $\widehat{r}_{ij}$ computed by MGS-naive, ignoring terms of $\mathcal{O}({\bf u}^{2})$ , is derived [11] as

[TABLE]

where we used $\|\,|A|\,\|\leq\|A\|_{\rm F}\leq\sqrt{m}\|A\|$ , $\ell\gamma_{k}\leq\gamma_{\ell k}$ and $\gamma_{k}+\gamma_{\ell}\leq\gamma_{k+\ell}$ [6].

In contrast, formulas (14)–(16) of MGS-HA in floating-point arithmetic become

[TABLE]

These formulas compute $r_{ij}$ from $\widehat{\bm{z}}_{i}^{(i-1)}$ and $\widehat{\bm{z}}_{j}^{(i-1)}$ . Because the local error of $A$ -inner product is defined as the difference between $\widehat{r}_{ij}$ and $\widehat{\bm{q}}_{i}^{\rm T}A\widehat{\bm{z}}_{j}^{(i-1)}$ , we also need a relationship between $\widehat{\bm{q}}_{i}$ and $\widehat{\bm{z}}_{i}^{(i-1)}$ , i.e.,

[TABLE]

Substituting (22), (21), (20) and (23) into $|\widehat{r}_{ij}-\widehat{\bm{q}}_{i}^{\rm T}A\widehat{\bm{z}}_{j}^{(i-1)}|$ in this order and ignoring terms of $\mathcal{O}({\bf u}^{2})$ , we have

[TABLE]

Comparing (19) for MGS-naive and (24) for MGS-HA, we know that the only difference is in the coefficients:

[TABLE]

In [11], it is shown that the strict upper triangular part of the loss of $A$ -orthogonality $\widehat{Q}^{\rm T}A\widehat{Q}-I_{n}$ , which is denoted as $\Delta E^{(3)}$ , can be bounded as

[TABLE]

where $\Delta E^{(2)}$ is a strict upper triangular matrix defined by

[TABLE]

and $\Delta{\bm{d}}_{j}^{(i)}$ ( $i<j$ ) and $\Delta{\bm{d}}_{j}^{(j)}$ are floating-point errors arising in the AXPY operation (9) and scaling (10), respectively111The definition of $[\Delta E^{(2)}]_{ij}$ given in [11] is actually the definition of $[\Delta E^{(2)}]_{ji}$ . We corrected this in Eq. (26).. See the proof of Theorem 3.2 in [11] for details.

In Eqs. (25)–(27), the AXPY error $\Delta{\bm{d}}_{j}^{(i)}$ ( $i<j$ ) and the scaling errors $\Delta{\bm{d}}_{j}^{(j)}$ and $\|\widehat{\bm{q}}_{i}\|_{A}^{2}-1$ can be bounded by the same expression in both methods, because their computational formulas are the same. The norm $\|\widehat{R}^{-1}\|$ can also be bounded in the same way in both methods. Hence, the only difference lies in the evaluation of the local error of $\widehat{r}_{ij}$ , defined as $\widehat{r}_{ij}-\widehat{\bm{q}}_{i}^{\rm T}A\widehat{\bm{z}}_{j}^{(i-1)}$ . But comparing (19) with (24) reveals that the difference in this part is slight. In addition, the diagonal part of $\widehat{Q}^{\rm T}A\widehat{Q}-I_{n}$ is nothing but the scaling error and has the same bound for both methods. Thus we can conclude that MGS-HA has the same a posteriori bound for the loss of $A$ -orthogonality as MGS-naive [11]:

[TABLE]

provided that $\mathcal{O}(m^{3/2}){\bf u}\kappa(A)\kappa(A^{1/2}Z)<1$ .

3.3 Analysis of CGS

For a variant of CGS, CGS-P [12], that computes the diagonal element $r_{jj}$ in a different way from the original CGS, error bounds for a non-standard inner product are given in [11]. On the other hand, error bounds of original CGS have not been well analyzed yet for a non-standard inner product.

However, we can estimate the influence of the proposed approach on the error bounds of CGS. As in the case of MGS, the only difference between CGS-naive and CGS-HA is how to compute $r_{ij}$ $(i<j)$ . For both CGS-naive and CGS-HA, the recurrence formulas are obtained from those of MGS-naive and MGS-HA, respectively, by changing $\widehat{\bm{z}}_{j}^{(i-1)}$ to $\widehat{\bm{z}}_{j}^{(0)}$ . Therefore, the local error in the computation of $\widehat{r}_{ij}$ can be evaluated by (19) and (24) by changing $\widehat{\bm{z}}_{j}^{(i-1)}$ to $\widehat{\bm{z}}_{j}^{(0)}$ . Thus, the local errors of $\widehat{r}_{ij}$ are nearly the same for both CGS-naive and CGS-HA. As a result, we can expect that CGS-HA has nearly the same loss of $A$ -orthogonality as CGS-naive.

4 Numerical experiments

In this section, we evaluate the computational performance of MGS-HA (Algorithm 5) and MGS-HP (Algorithm 6). In particular, we compare the computation time and the loss of $A$ -orthogonality of these methods with those of MGS-naive (Algorithm 3), CGS-naive and Cholesky QR, the last of which is one of the fastest algorithms for (1).

4.1 Numerical experiment I

Firstly, we compare the computation time of MGS-naive, MGS-HA, MGS-HP and Cholesky QR for two different problems. For the first problem, $A$ is a random dense spd matrix with $m=10000$ . For the second problem, $A$ is a sparse spd matrix AUNW9180 obtained from ELSES matrix library [3]. This is an overlap matrix in an electronic structure calculation of a helical multishell gold nanowire. The size of the matrix is $m=9180$ and the number of non-zero entries is $nnz=3557446$ . For both problems, we set $Z$ to be a random dense matrix. We test $n=5,10,\dots,100,200,\dots,2000$ .

All the numerical experiments were carried out in double precision arithmetic on OS: CentOS 64bit, CPU: Intel Xeon CPU E5-2667 3.20GHz (1 core), Memory: 48GB. We used Intel MKL for matrix computations and Mersenne twister for generating random matrices.

Figure 1 shows the computation time for both problems, while Figure 2 shows breakdown of the computation time scaled by the total computation time of MGS-naive for each $n$ . When $n\ll m$ , most of the computation time is used for computing MV and hence the total time increases proportionally to $n$ ; see the left columns of Figure 1 and Figure 2. In this situation, MGS-HA achieves 2x speedup over MGS-naive. MGS-HP and Cholesky QR are even faster and show drastic speedup over these methods. On the other hand, as $n$ becomes larger, the ratio of computation time for other parts increases, especially for the sparse problem. In this situation, the speedup ratio of the proposed methods becomes relatively small, although both methods are still faster than MGS-naive.

4.2 Numerical experiment II

Next, we compare the loss of $A$ -orthogonality

[TABLE]

of MGS-naive, MGS-HA, MGS-HP, CGS-naive and Cholesky QR. Let $V\in\mathbb{R}^{m\times m}$ be a random orthogonal matrix. Then, we set $A$ as

[TABLE]

where

[TABLE]

so that $\log_{10}d_{i}$ are evenly spaced. Also, let $W\in\mathbb{R}^{n\times n}$ be a random orthogonal matrix and $U_{1},U_{2}\in\mathbb{R}^{m\times n}$ be matrices whose columns are eigenvectors of $A$ corresponding to the $n$ largest and the $n$ smallest eigenvalues, respectively. Then, we set $Z$ as

[TABLE]

where

[TABLE]

Case 1 and case 2 provide a best case and a worst case with respect to the loss of $A$ -orthogonality, respectively [10]. We set $m=100,n=20$ and test $28^{2}$ problems with $\kappa(A),\kappa(A^{1/2}Z)=10^{0.5},10^{1},10^{1.5},\dots,10^{14}$ for each case.

All the numerical experiments were carried out in MATLAB2016a. We used Mersenne twister for generating random matrices.

We present log10 of the loss of $A$ -orthogonality as a function of $\kappa(A)$ and $\kappa(A^{1/2}Z)$ for case 1 and case 2 in Figures 3 and 4, respectively. In case 1 (the best case), the loss of $A$ -orthogonality of all methods depends only on $\kappa(A^{1/2}Z)$ ; in contrast, it depends on both $\kappa(A^{1/2}Z)$ and $\kappa(A)$ in case 2 (the worst case). In both cases, MGS-naive, MGS-HA and MGS-HP show better accuracy than CGS-naive and Cholesky QR: the dependence on $\kappa(A^{1/2}Z)$ is linear for the former and quadratic for the latter. Here, we note that Cholesky QR failed when $\kappa(A^{1/2}Z)\geq 10^{8}$ .

Next, we compare the proposed implementations, MGS-HA and MGS-HP, with MGS-naive. MGS-HP shows nearly the same accuracy as MGS-naive in both cases and MGS-HA shows nearly the same accuracy as MGS-naive in case 1. In addition, as a remarkable result, we observe that MGS-HA shows better accuracy than MGS-naive in case 2, especially when both $A$ and $Z$ are ill-conditioned: $\kappa(A),\kappa(A^{1/2}Z)\gg 1$ ; see Figure 4(b).

4.3 Numerical experiment III

Here, we compare the loss of $A$ -orthogonality of the computed results of MGS-naive, MGS-HA and MGS-HP in case 2 with the theoretical error bound (28) derived in Section 3. As shown in Section 3, the loss of $A$ -orthogonality of MGS-naive and MGS-HA are bounded by

[TABLE]

Figure 6 shows the upper bound $\delta_{1}$ as a function of $\kappa(A)$ and $\kappa(A^{1/2}Z)$ , while Figure 6 plots the actual loss of $A$ -orthogonality against $\delta_{1}$ . Comparing Figures 4(a), (c) and 6 reveals that $\delta_{1}$ represents the actual loss of $A$ -orthogonality well for MGS-naive and MGS-HP. In fact, we can see from Figure 6(a), (c) that $\delta_{1}$ is not only an upper bound, but also a good estimate of the actual loss of $A$ -orthogonality for MGS-naive and MGS-HP. For MGS-HA, however, there are many computational results for which the loss of $A$ -orthogonality is much lower than suggested by $\delta_{1}$ ; see Figure 6(b). This indicates that although $\delta_{1}$ is certainly an upper bound for MGS-HA, it may not be a sharp upper bound.

Instead of $\delta_{1}$ , let us consider the following quantity:

[TABLE]

Figure 8 shows $\delta_{2}$ as a function of $\kappa(A)$ and $\kappa(A^{1/2}Z)$ , while Figure 8 plots the actual loss of $A$ -orthogonality against $\delta_{2}$ . Comparing Figures 4(b) and 8, we can see that $\delta_{2}$ represents the computational results of MGS-HA well. We also observe, from Figure 8(b), that $\delta_{2}$ is a very sharp upper bound for MGS-HA. It is to be noted that $\delta_{2}$ does not serve as an upper bound of the loss of $A$ -orthogonality for MGS-naive and MGS-HP; see Figure 8(a), (c).

Although the quantity $\delta_{2}$ we introduced here has no theoretical background yet, the numerical results suggest that it can describe the actual loss of $A$ -orthogonality for MGS-HA very well. Based on this fact, we make the following conjecture on a sharper upper bound for the loss of $A$ -orthogonality of MGS-HA.

Conjecture 1.

The loss of $A$ -orthogonality of MGS-HA can be bounded as

[TABLE]

5 Conclusions

In this paper, we propose two types of efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product. These methods, named MGS-HA and MGS-HP, require only $n$ MV, in contrast to the naive implementation, MGS-naive, that requires $2n$ MV. Experimental results show that both methods are much faster than MGS-naive. Specifically, MGS-HP is nearly as fast as Cholesky QR for small $n$ . Regarding accuracy, we prove that MGS-HA has nearly the same error bounds for representation error and loss of $A$ -orthogonality as MGS-naive. According to the numerical experiments, MGS-HP shows nearly the same accuracy and MGS-HA shows higher accuracy than MGS-naive. We also introduce a conjecture on a sharper upper bound for the loss of $A$ -orthogonality for MGS-HA (Conjecture 1).

In the future, we expect to prove Conjecture 1 and also derive an upper bound for MGS-HP. We also plan to evaluate the computational performance of the proposed implementations for large problems in parallel environments.

Acknowledgment

The present study is supported in part by Japan Science and Technology Agency, ACT-I (No. JPMJPR16U6) and the Japanese Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Scientific Research (Nos. 26286087, 15H02708, 15H02709, 16KT0016).

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Å. Björck, Numerical Methods for Least Squares Problems, SIAM, 1996.
2[2] A. A. Dubrulle, Retooling the method of block conjugate gradients, ETNA, 12 (2001), 216–233.
3[3] ELSES matrix library, http://www.elses.jp/matrix/ .
4[4] A. Essai, Weighted FOM and GMRES for solving nonsymmetric linear systems, Numer. Alg., 18 (1998), 277–292.
5[5] M. Gulliksson, On the modified Gram-Schmidt algorithm for weighted and constrained linear least squares problems, BIT Numerical Mathematics, 35 (1995) 453–468.
6[6] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002.
7[7] A. Imakura, L. Du, H. Tadano, A Weighted Block GMRES method for solving linear systems with multiple right-hand sides, JSIAM Letters, 5 (2013), 65–68.
8[8] A. Imakura, T. Sakurai, Block Krylov-type complex moment-based eigensolvers for solving generalized eigenvalue problems, Numer. Alg., (accepted).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner product

Abstract

1 Introduction

2 Efficient implementations of MGS

2.1 Naive implementations with 2n2n2n MV

2.2 Estimation of the minimal computational costs

Proposition 1**.**

Proof.

2.3 nnn-MV implementations of MGS: MGS-HA and MGS-HP

3 Analysis of error bounds

3.1 Upper bound of representation error

3.2 Upper bound of loss of AAA-orthogonality

3.3 Analysis of CGS

4 Numerical experiments

4.1 Numerical experiment I

4.2 Numerical experiment II

4.3 Numerical experiment III

Conjecture 1**.**

5 Conclusions

Acknowledgment

2.1 Naive implementations with $2n$ MV

Proposition 1.

2.3 $n$ -MV implementations of MGS: MGS-HA and MGS-HP

3.2 Upper bound of loss of $A$ -orthogonality

Conjecture 1.