A Parallel Hierarchical Blocked Adaptive Cross Approximation Algorithm

Yang Liu; Wissam Sid-Lakhdar; Elizaveta Rebrova; Pieter Ghysels,; Xiaoye Sherry Li

arXiv:1901.06101·math.NA·September 6, 2019·Int. J. High Perform. Comput. Appl.

A Parallel Hierarchical Blocked Adaptive Cross Approximation Algorithm

Yang Liu, Wissam Sid-Lakhdar, Elizaveta Rebrova, Pieter Ghysels,, Xiaoye Sherry Li

PDF

TL;DR

This paper introduces a hierarchical blocked adaptive cross approximation algorithm that enhances low-rank matrix decompositions with improved convergence and efficiency, suitable for parallel computing environments.

Contribution

It proposes a novel hierarchical BACA algorithm that combines adaptive cross approximation with hierarchical merging, improving convergence and computational efficiency over traditional methods.

Findings

01

Significantly improved convergence over baseline ACA

02

Reduced computational complexity compared to full decompositions

03

Demonstrated efficiency and parallel scalability in numerical tests

Abstract

This paper presents a hierarchical low-rank decomposition algorithm assuming any matrix element can be computed in $O (1)$ time. The proposed algorithm computes rank-revealing decompositions of sub-matrices with a blocked adaptive cross approximation (BACA) algorithm, followed by a hierarchical merge operation via truncated singular value decompositions (H-BACA). The proposed algorithm significantly improves the convergence of the baseline ACA algorithm and achieves reduced computational complexity compared to the full decompositions such as rank-revealing QR decompositions. Numerical results demonstrate the efficiency, accuracy and parallel efficiency of the proposed algorithm.

Tables2

Table 1. Table 1 : Flop counts and communication costs for the leaf-level compression and hierarchical merge operations in Algorithm 3 for two classes of low-rank matrices . n 𝑛 n and r 𝑟 r denote matrix dimension and rank. d 𝑑 d denotes the block size in BACA. p 𝑝 p and n b subscript 𝑛 𝑏 n_{b} denote number of processes and leaf-level submatrices. s l subscript 𝑠 𝑙 s_{l} denotes maximum ranks among all level- l 𝑙 l submatrices.

	constant rank	increasing rank
	$s_{l} \approx r$	$s_{l} \approx r / \sqrt{n_{b}} \times 2^{l}$
BACA $d \leq s_{0}$	$O (n r^{2} \sqrt{n_{b}})$	$O (n r^{2}) / \sqrt{n_{b}}$
Merge compute	$O (n r^{2} \sqrt{n_{b}})$	$O (n r^{2})$
Merge communicate	$[O (r \log^{2} p), O (n r \log^{2} p / \sqrt{p})]$	$[O (r \log p), O (n r \log p / \sqrt{p})]$

Table 2. Table 2 : Comparisons between proposed BACA, H-BACA algorithms and existing ACA algorithms. Note that the algorithms show increasing robustness from left to right.

Algorithm	ACA/ ${ACA}^{+}$	Hyrbird-ACA	BACA	H-BACA
Pivot count per iteration	1	1	$d$	$n_{b} d$
Cost (constant rank)	$O (n r^{2})$	$O (n r^{2})$	$O (n r^{2})$	$O (n r^{2} \sqrt{n_{b}})$
Cost (increasing rank)	$O (n r^{2})$	$O (n r^{2})$	$O (n r^{2})$	$O (n r^{2})$
Pre-selection of submatrices	no	yes	no	no

Equations35

A \approx U V = k = 1 \sum r u_{k} v_{k}^{t}

A \approx U V = k = 1 \sum r u_{k} v_{k}^{t}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{i_{k}=\operatorname*{arg\,max}_{i\neq i_{1},...,i_{k-1}}\lvert E_{k-1}(:,j_{k})\rvert}}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{i_{k}=\operatorname*{arg\,max}_{i\neq i_{1},...,i_{k-1}}\lvert E_{k-1}(:,j_{k})\rvert}}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{j_{k+1}=\operatorname*{arg\,max}_{j\neq j_{1},...,j_{k}}\lvert E_{k-1}(i_{k},:)\rvert}}

ν = u_{k} v_{k}^{t}_{F} \approx ∥ A - U V ∥_{F}, μ = ∥ U V ∥_{F} \approx ∥ A ∥_{F}

ν = u_{k} v_{k}^{t}_{F} \approx ∥ A - U V ∥_{F}, μ = ∥ U V ∥_{F} \approx ∥ A ∥_{F}

\displaystyle A\approx UV=\sum_{k=1}^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{n_{d}}}}U_{k}V_{k}

\displaystyle A\approx UV=\sum_{k=1}^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{n_{d}}}}U_{k}V_{k}

[Q_{k}^{c}, T_{k}^{c}, I_{k}]

[Q_{k}^{c}, T_{k}^{c}, I_{k}]

[Q_{k + 1}^{r}, T_{k + 1}^{r}, J_{k + 1}]

[Q, T, \overset{ˉ}{J}] = QR (W_{k}, ϵ) with Q \in R^{d \times d_{k}}

[Q, T, \overset{ˉ}{J}] = QR (W_{k}, ϵ) with Q \in R^{d \times d_{k}}

U_{k} = C_{k} (:, \overset{ˉ}{J}), V_{k} = T^{- 1} Q^{t} R_{k}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{I_{k}\leftarrow I_{k}([1,d_{k}]),J_{k}\leftarrow J_{k}(\bar{J})}}

T_{U_{k}}

T_{U_{k}}

ν

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mu}}^{2}\leftarrow\mu^{2}+\nu^{2}+2\sum_{i=1}^{r_{k-1}}\sum_{j=1}^{d_{k}}{\tilde{V}(i,j)}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mu}}^{2}\leftarrow\mu^{2}+\nu^{2}+2\sum_{i=1}^{r_{k-1}}\sum_{j=1}^{d_{k}}{\tilde{V}(i,j)}

\tilde{V} = (V V_{k}^{t}) \circ (U^{t} U_{k})

\overset{ˉ}{U}_{τ_{i} ν} = [U_{τ_{i} ν_{1}} Σ_{τ_{i} ν_{1}}, U_{τ_{i} ν_{2}} Σ_{τ_{i} ν_{2}}], \overset{ˉ}{V}_{τ_{i} ν} = diag (V_{τ_{i} ν_{1}}, V_{τ_{i} ν_{2}})

\overset{ˉ}{U}_{τ_{i} ν} = [U_{τ_{i} ν_{1}} Σ_{τ_{i} ν_{1}}, U_{τ_{i} ν_{2}} Σ_{τ_{i} ν_{2}}], \overset{ˉ}{V}_{τ_{i} ν} = diag (V_{τ_{i} ν_{1}}, V_{τ_{i} ν_{2}})

[U_{τ_{i} ν}, Σ_{τ_{i} ν}, V_{τ_{i} ν}, r_{τ_{i} ν}] \leftarrow SVD (\overset{ˉ}{U}_{τ_{i} ν}, ϵ), V_{τ_{i} ν} \leftarrow V_{τ_{i} ν} \overset{ˉ}{V}_{τ_{i} ν}

c_{B A C A} = k = 1 \sum O (⌈ r / d ⌉) (n d^{2} + n r_{k} d + d_{k} d^{2})

c_{B A C A} = k = 1 \sum O (⌈ r / d ⌉) (n d^{2} + n r_{k} d + d_{k} d^{2})

\leq O (n d^{2} + r d^{2} + n r d) O (⌈ r / d ⌉) = O (n r^{2})

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{c_{b}}}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{c_{b}}}

c_{m} = l = 1 \sum L O (4^{L - l} n_{l} s_{l}^{2})

c_{m} = l = 1 \sum L O (4^{L - l} n_{l} s_{l}^{2})

\displaystyle v_{m}=\sum_{l=1}^{L}\Big{[}O(s_{l}\mathrm{log}p_{l}),O\Big{(}\frac{n_{l}s_{l}\mathrm{log}p_{l}}{\sqrt{p_{l}}}\Big{)}\Big{]}

\displaystyle v_{m}=\sum_{l=1}^{L}\Big{[}O(s_{l}\mathrm{log}p_{l}),O\Big{(}\frac{n_{l}s_{l}\mathrm{log}p_{l}}{\sqrt{p_{l}}}\Big{)}\Big{]}

\displaystyle=\sum_{l=1}^{L}\Big{[}O(ls_{l}),O\Big{(}\frac{lns_{l}}{\sqrt{p}}\Big{)}\Big{]}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\corrauth

Yang Liu, Computational Research Divisio Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

A Parallel Hierarchical Blocked Adaptive Cross Approximation Algorithm

Yang Liu11affiliationmark:

Wissam Sid-Lakhdar11affiliationmark:

Elizaveta Rebrova22affiliationmark:

Pieter Ghysels11affiliationmark: and Xiaoye Sherry Li11affiliationmark:

11affiliationmark: Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

22affiliationmark: Department of Mathematics, University of California, Los Angeles, CA, USA

[email protected]

Abstract

This paper presents a low-rank decomposition algorithm assuming any matrix element can be computed in $O(1)$ time. The proposed algorithm first computes rank-revealing decompositions of sub-matrices with a blocked adaptive cross approximation (BACA) algorithm, and then applies a hierarchical merge operation via truncated singular value decompositions (H-BACA). The proposed algorithm significantly improves the convergence of the baseline ACA algorithm and achieves reduced computational complexity compared to the full decompositions such as rank-revealing QR. Numerical results demonstrate the efficiency, accuracy and parallel scalability of the proposed algorithm.

keywords:

Adaptive cross approximation, singular value decomposition, rank-revealing decomposition, parallelization, multi-level algorithms

1 Introduction

Rank-revealing decomposition algorithms are important numerical linear algebra tools for compressing high-dimensional data, accelerating solution of integral and partial differential equations, constructing efficient machine learning algorithms, and analyzing numerical algorithms, etc, as matrices arising from many science and engineering applications oftentimes exhibit numerical rank-deficiency. Despite the favorable $O(nr)$ memory footprint of such decompositions with $n$ and $r$ respectively denoting the matrix dimension (assuming a square matrix) and the numerical rank, the computational cost can be expensive. Existing rank-revealing decompositions such as truncated singular value decomposition (SVD), column-pivoted QR (QRCP), CUR decomposition, interpolative decomposition (ID), and rank-revealing LU typically require at least $O(n^{2}r)$ operations Gu and Eisenstat (1996); Cheng et al. (2005); Voronin and Martinsson (2017); Mahoney and Drineas (2009). This complexity can be reduced to $O(n^{2}\mathrm{log}\>r+nr^{2})$ by structured random matrix projection-based algorithms Voronin and Martinsson (2017); Liberty et al. (2007). In addition, faster algorithms are available in the following three scenarios. 1. When each element entry can be computed in $O(1)$ CPU time with prior knowledge (i.e., smoothness, sparsity, or leverage scores) about the matrix, faster algorithms such as randomized CUR and adaptive cross approximation (ACA) Bebendorf (2000); Bebendorf and Grzhibovskis (2006); Zhao et al. (2005) algorithms can achieve $O(nr^{2})$ complexity. However, the robustness of these algorithms relies heavily on matrix properties that are not always present in practice. 2. When the matrix can be rapidly applied to arbitrary vectors, algorithms such as randomized SVD, QR and UTV (T lower or upper triangular) Liberty et al. (2007); Xiao et al. (2017); Feng et al. (2018a); Martinsson et al. (2017) can be utilized to achieve quasi-linear complexity. 3. Finally, given a matrix with missing entries, the low-rank decomposition can be constructed via matrix completion algorithms Candès and Recht (2009); Balzano et al. (2010) in quasi-linear time assuming incoherence properties of the matrices (i.e., projection of natural basis vectors onto the space spanned by singular vectors of the matrix should not be very sparse). This work concerns the development of a practical algorithm, in application scenario 1, that improves the robustness of ACA algorithms while maintaining reduced complexity for broad classes of matrices.

The partially-pivoted ACA algorithm, closely related to LU with rook pivoting Foster (1997), constructs an LU-type decomposition upon accessing one row and column per iteration. For matrices resulting from asymptotically smooth kernels, ACA is a rank-revealing and optimal-complexity algorithm that converges in $O(k)$ iterations Bebendorf (2000). Despite its favorable computational complexity, it is well-known that the ACA algorithm suffers from deteriorated convergence and/or premature termination for non-smooth, sparse and/or coherent matrices Heldring et al. (2014). Hybrid methods or improved convergence criteria (e.g., hybrid ACA-CUR, averaging, statistical norm estimation) have been proposed to partially alleviate the problem Heldring et al. (2015); Grasedyck and Hackbusch (2005). The main difficulty of leveraging ACA as robust algebraic tools for general low-rank matrices results from ACA’s partial pivot-search strategy to attain low complexity. In addition to the abovementioned remedies, another possibility to improve ACA’s robustness is to search for pivots in a wider range of rows/columns without sacrificing too much computational efficiency. Here we consider two different strategies: 1. Instead of searching one row/column per iteration as in ACA, it is possible to search a block of rows/columns to find multiple pivots together. 2. Instead of applying ACA directly on the entire matrix, it is possible to start with compressing submatrices via ACA and then merge the results as one low-rank product. In extreme cases (e.g., when block size equals matrix dimension or submatrix dimension equals one), these strategies lead to quadratic computational costs. Therefore, it is valuable to address the question: for what matrix kernels and under what block/submatrix sizes will these strategies retain low complexity.

For the first strategy, this work proposes a blocked ACA algorithm (BACA) that extracts a block row/column per iteration to significantly improve convergence of the baseline ACA algorithms. The blocked version also enjoys higher flop performance as it involves mainly BLAS-3 operations. Compared to the aforementioned remedies, the proposed algorithm provides a unified framework to balance robustness and efficiency. Upon increasing the block size (i.e., the number of rows/columns per iteration), the algorithm gradually changes from ACA to ID. For the second strategy, the proposed algorithm further subdivides the matrix into $n_{b}$ submatrices compressed via BACA, followed by a hierarchical merge algorithm leveraging low-rank arithmetic Hackbusch et al. (2002); Grasedyck and Hackbusch (2003). The overall cost of this H-BACA algorithm is at most $O(\sqrt{n_{b}}nr^{2})$ assuming the block size in BACA is less than the rank. In other words, the proposed H-BACA algorithm is a general numerical linear algebra tool as an alternative to ACA, SVD, QR, etc. In addition, the overall algorithm can be parallelized using distributed-memory linear algebra packages such as ScaLAPACK Blackford et al. (1997) which avoids the difficulty of efficient parallelization of plain ACA algorithms. Numerical results illustrate good accuracy, efficiency and parallel performance. In addition, the proposed algorithm can be used as a general low-rank compression tool for constructing hierarchical matrices Rebrova et al. (2018).

2 Notation

Throughout this paper, we adopt the Matlab notation of matrices and vectors. Submatrices of a matrix $A$ are denoted $A(I,J)$ , $A(:,J)$ or $A(I,:)$ where $I$ , $J$ are index sets. Similarly, subvectors of a column vector $u$ are denoted $u(I)$ . An index set $I$ permuted by $J$ reads $I(J)$ . Transpose, inverse, pseudo-inverse of $A$ are $A^{t}$ , $A^{-1}$ , $A^{\dagger}$ . $\left\lVert A\right\rVert_{F}$ and $\left\lVert u\right\rVert_{2}$ denote Frobenius norm and 2-norm. Note that $u$ refers to a $n\times 1$ column vector. Vertical and horizontal concatenations of $A$ , $B$ are $[A;B]$ and $[A,B]$ . Element-wise multiplication of $A$ and $B$ is $A\circ B$ . All matrices are real-valued unless otherwise stated. It is assumed for $A\in\mathbb{R}^{m\times n}$ , $m=O(n)$ , but the proposed algorithms also apply to complex-valued and tall-skinny / short-fat matrices. We denote truncated SVD as $[U,\Sigma,V,r]=\mathtt{SVD}(A,\epsilon)$ with $U\in\mathbb{R}^{m\times r}$ , $V^{t}\in\mathbb{R}^{n\times r}$ column orthogonal, $\Sigma\in\mathbb{R}^{r\times r}$ diagonal, and $r$ being $\epsilon$ -rank defined by $r=\min\{k\in\mathbb{N}:\Sigma_{k+1,k+1}<\epsilon\Sigma_{1,1}\}$ . We denote QRCP as $[Q,T,J]=\mathtt{QR}(A,r)$ or $[Q,T,J]=\mathtt{QR}(A,\epsilon)$ with $Q\in\mathbb{R}^{m\times r}$ column orthogonal, $T\in\mathbb{R}^{r\times n}$ upper triangular, $J$ being column pivots, and $\epsilon$ and $r$ being the prescribed accuracy and rank, respectively. QR without column-pivoting is simply written as $[Q,T]=\mathtt{QR}(A)$ . Cholesky decomposition without pivoting is written as $T=\mathtt{Chol}(A)$ with $T$ upper triangular. $\mathrm{log}n$ means logarithm of $n$ to the base 2.

3 Algorithm Description

3.1 Adaptive Cross Approximation

Before describing the proposed algorithm, we first briefly summarize the baseline ACA algorithm Zhao et al. (2005). Consider a matrix $A\in\mathbb{R}^{m\times n}$ of $\epsilon$ -rank $r$ , the ACA algorithm approximates $A$ by a sequence of rank-1 outer-products as

[TABLE]

At each iteration $k$ , the algorithm selects column $u_{k}$ (pivot $j_{k}$ from remaining columns) and row $v_{k}^{t}$ (pivot $i_{k}$ from remaining rows) from the residual matrix $E_{k-1}=A-\sum_{i=1}^{k-1}u_{i}v_{i}^{t}$ corresponding to an element denoted by $E_{k-1}(i_{k},j_{k})$ with sufficiently large magnitude. Note that $u_{k}$ and $v_{k}$ are $m\times 1$ and $n\times 1$ vectors. The partially-pivoted ACA algorithm (ACA for short), selecting $j_{k},i_{k}$ by only looking at previously selected rows and columns, is described as Algorithm 1. Specifically, each iteration $k$ selects pivot $i_{k}$ used in the current iteration and pivot $j_{k+1}$ for the next iteration (via line 1 and 1) as

[TABLE]

and $j_{1}$ is a random initial column index. Note that $i_{k}\neq i_{1},...,i_{k-1}$ and $j_{k}\neq j_{1},...,j_{k-1}$ are enforced. The iteration is terminated when $\nu<\epsilon\mu$ with

[TABLE]

and $\epsilon$ is the prescribed tolerance. Note that each iteration requires only $O(nr_{k})$ flop operations with $r_{k}$ denoting currently revealed numerical rank. The overall complexity of partially-pivoted ACA scales as $O(nr^{2})$ when the algorithm converges in $O(r)$ iterations. Despite the favorable complexity, the convergence of ACA for general rank-deficient matrices is unsatisfactory. For many rank-deficient matrices arising from the numerical solution of PDEs, signal processing and data science, ACA oftentimes either requires $O(n)$ iterations or exhibits premature termination. First, as ACA does not search the full residual matrices for the largest element, it cannot avoid selection of smaller pivots for general rank-deficient matrices and may require $O(n)$ iterations. Second, the approximation $\left\lVert u_{k}v_{k}^{t}\right\rVert_{F}$ in (4) often causes the premature termination with the selection of smaller pivots. Remedies such as averaged stopping criteria Zhou et al. (2017), stochastic error estimation Heldring et al. (2015), ACA+ Grasedyck and Hackbusch (2005), and hybrid ACA Grasedyck and Hackbusch (2005) have been developed but they do not generalize to a broad range of applications.

3.2 Blocked Adaptive Cross Approximation

Instead of selecting only one column and row from the residual matrix in each ACA iteration, we can select a fixed-size block of columns and rows per iteration to improve the convergence and accuracy of ACA. In addition, many BLAS-1 and BLAS-2 operations of ACA become BLAS-3 operations and hence higher flop performance can be achieved.

Specifically, the proposed BACA algorithm factorizes $A$

[TABLE]

where $U_{k}\in\mathbb{R}^{m\times d_{k}}$ and $V_{k}\in\mathbb{R}^{d_{k}\times n}$ . In principle, the algorithm selects a block of $d$ rows and columns via cross approximations in the residual matrix and then $d_{k}\leq d$ ones via rank-revealing algorithms to form a low-rank update at iteration $k$ . The total number of iterations is approximately $n_{d}\approx\lceil r/d\rceil$ if $d_{k}\approx d$ . Instead of selecting row/column pivots via lines 1 and 1 of Algorithm 1, the proposed algorithm selects row and column index sets $I_{k}$ and $J_{k}$ by performing QRCP on $d$ columns (more precisely their transpose) and rows of the residual matrices. This proposed strategy is described in Algorithm 2.

Each BACA iteration is composed of three steps.

•

Find block row $I_{k}$ and block column $J_{k+1}$ by QRCP. Starting with a random column index set $J_{1}$ , the block row $I_{k}$ and the next iteration’s block column $J_{k+1}$ are selected by (line 2 and 2)

[TABLE]

Here the algorithm first selects $d$ skeleton rows from the submatrix $E_{k-1}(J_{k},:)$ (i.e., $d$ columns from its transpose) and then selects $d$ skeleton columns from the submatrix $E_{k-1}(I_{k},:)$ by leveraging the LAPACK implementation of QRCP as it provides a simple way of greedily selecting well-conditioned columns by examining column norms in the $R$ factor at each iteration. Note that many other subset selection algorithms exist in both the machine learning and numerical linear algebra communities (e.g., strong rank-revealing QR Gu and Eisenstat (1996), spectrum-revealing QR Feng et al. (2018b), and column subset selection problems Boutsidis et al. (2009)), which ideally pick $d$ matrix columns with maximum volumes. Note that $I_{k}$ excludes rows selected in previous iterations. To efficiently enforce such condition, the QRCP is performed on the submatrix of $E_{k-1}^{t}(:,J_{k})$ excluding previously selected rows rather than directly on $E_{k-1}^{t}(:,J_{k})$ . Similarly, $J_{k}$ excludes columns selected in previous iterations. See Fig. 1(a) for an illustration of the procedure. $I_{k}$ and $J_{k+1}$ are selected by QRCP on the column and transpose of the row marked in yellow, respectively. The column marked in grey is used to select $I_{k+1}$ in the next iteration. For illustration purpose, index sets in Fig. 1(a) consist of contiguous indices.

•

Form the factors of the low-rank product $U_{k}V_{k}$ . Let $C_{k}=E_{k-1}(:,J_{k})$ , $R_{k}=E_{k-1}(I_{k},:)$ and $W_{k}=E_{k-1}(I_{k},J_{k})$ , $E_{k-1}$ can be approximated by an ID-type decomposition $E_{k-1}\approx C_{k}W_{k}^{\dagger}R_{k}=U_{k}V_{k}$ Voronin and Martinsson (2017) by (8) and (9). Note that the pseudo inverse is computed via rank-revealing QR (also see the LRID algorithm at line 2). The rank-revealing algorithm is needed as the $d\times d$ block $W_{k}$ can be further compressed with rank $d_{k}$ . Particularly for matrices where the ACA algorithm tends to fail, the corresponding $d\times d$ matrices $W_{k}$ in BACA are often rank-deficient. In this case, BACA becomes more robust than ACA as the effective $d_{k}$ pivots can still be used to generate $d$ columns $J_{k+1}$ for the next iteration (as long as $d_{k}>0$ ). Consequently, the effective rank increase is $d_{k}\leq d$ and the pivot pair $(I_{k},J_{k})$ is updated in (10) by the column pivots $\bar{J}$ of QRCP in (8).

[TABLE]

•

Compute $\nu=\left\lVert U_{k}V_{k}\right\rVert_{F}$ and update $\mu=\left\lVert UV\right\rVert_{F}$ . Assuming constant block size $d$ , the norm of the low-rank update can be computed in $O(nd_{k}^{2})$ operations (line 2) via

[TABLE]

Once $\nu$ is computed, the norm of $UV$ can be updated efficiently in $O(nr_{k}d_{k})$ operations (line 2) as

[TABLE]

where $r_{k}$ represents the column dimension of $U$ at iteration $k$ . Note that the matrix multiplications in (11) and (13) involving $V_{k}$ and $V$ (and similarly for those involving $U_{k}$ and $U$ ) can be performed as $[V,V_{k}]V_{k}^{t}$ to further improve the computational efficiency. Then the algorithm updates $U$ , $V$ as $[U,U_{k}]$ , $[V;V_{k}]$ and tests the stopping criteria $\nu<\epsilon\mu$ . Note that $\nu,\mu$ with larger $d$ provides better approximations to the exact stop criteria compared to those in (4) hence can significantly reduce the chance of premature termination.

We would like to highlight the difference between the proposed BACA algorithm and existing ACA algorithms. First, as BACA selects a block of rows and columns per iteration as opposed to a single row and column in the baseline ACA algorithm, the convergence behavior and flop performance can be significantly improved. In the existing ACA algorithms, convergence can also be improved by leveraging averaged stopping criteria Zhou et al. (2017) or searching a single pivot in a broader range of rows and columns (e.g., fully-pivoted ACA). However, they still find one row or column at a time in each iteration and hence suffer from poor flop performance. Moreover, they cannot utilize strong rank revealing algorithms to select skeleton rows and columns with better volume (determinant in modulus) qualities. Second, BACA also has important connections to the hybrid ACA algorithm Grasedyck and Hackbusch (2005). The hybrid ACA algorithm assumes prior knowledge about the skeleton rows and columns to leverage interpolation algorithms (e.g., ID and CUR) on a skeleton submatrix and use ACA to refine the skeletons. In contrast, BACA uses cross approximations with QRCP to select skeleton rows and columns and uses interpolation algorithms (LRID at line 2) to form the low-rank update in each iteration. In other words, hybrid ACA can be treated as embedding ACA into interpolation algorithms while BACA can be thought of as embedding interpolation algorithms into ACA iterations. In addition, BACA is purely algebraic and requires no prior knowledge of the row/column skeletons or geometrical information about the rows/columns.

It is worth mentioning that the choice of $d$ affects the trade-off between efficiency and robustness of the BACA algorithm. When $d<r$ , the algorithm requires $O(nr^{2})$ operations assuming convergence in $O(r/d)$ iterations as each iteration requires $O(nr_{k}d)$ operations. For example, BACA (Algorithm 2) precisely reduces to ACA (Algorithm 1) when $d=1$ . In what follows we refer to the baseline ACA algorithm as BACA with $d=1$ . On the other hand, BACA converges in a constant number of iterations when $d\gg r$ . In the extreme case, BACA reduces to QRCP-based ID when $d=\mathrm{min}\{m,n\}$ (note that the LRID algorithm at line 2 remains the only nontrivial operation). In this case the algorithm requires $O(n^{2}r)$ operations but enjoys the provable convergence of QRCP. Detailed complexity analysis of the BACA algorithm will be provided in Section Cost Analysis.

The BACA algorithm oftentimes exhibits overestimated ranks compared to those revealed by truncated SVD. Therefore, an SVD re-compression step of $U$ and $V$ may be needed via first computing a QR of $U$ and $V$ as $[Q_{U},T_{U}]=\mathtt{QR}(U)$ , $[Q_{V},T_{V}]=\mathtt{QR}(V^{t})$ , and then a truncated SVD of $T_{U}T_{V}^{t}$ Heldring et al. (2015). The result can be viewed as an approximate truncated SVD of $A$ and we assume this is the output of the BACA algorithm in the rest of this paper.

3.3 Parallel Hierarchical Low-Rank Merge

The distributed-memory implementations of the proposed BACA algorithm and the baseline ACA algorithm can pose performance challenges as straightforward parallelization of all operations in Algorithm 2 and 1 involves many collective communications. To see this, assuming the $U$ and $V$ factors in Algorithm 1 follow 1D block row and column data layouts, then every operation from line 3 to line 9 requires one or more collective communications. Instead, one can assign one process to perform BACA/ACA on submatrices without any communication and then leverage parallel low-rank arithmetic to merge the results into one single low-rank product. To elucidate the proposed algorithm, we first describe the hierarchical low-rank merge algorithm then outline its parallel implementation.

Given a matrix $A\in\mathbb{R}^{m\times n}$ with $m\approx n$ , the algorithm first creates $L$ -level binary trees for index vectors $[1,m]$ and $[1,n]$ with index set $I_{\tau}$ and $J_{\nu}$ for nodes $\tau$ and $\nu$ at each level, upon recursively dividing each index set into $I_{\tau_{i}}$ / $J_{\nu_{j}}$ of approximately equal sizes, $i=1,2$ , $j=1,2$ . Here, $\tau_{i}$ and $\nu_{j}$ are children of $\tau$ and $\nu$ , respectively. The leaf and root levels are denoted [math] and $L$ , respectively. This process generates $n_{b}$ leaf-level submatrices of similar sizes. For simplicity, it is assumed $n_{b}=4^{L}$ . We denote submatrices associated with $\tau,\nu$ as $A_{\tau\nu}=A(I_{\tau},J_{\nu})$ and their truncated SVD as $[U_{\tau\nu},\Sigma_{\tau\nu},V_{\tau\nu},r_{\tau\nu}]=\mathtt{SVD}(A_{\tau\nu},\epsilon)$ . Here $r_{\tau\nu}$ is the $\epsilon$ -rank of $A_{\tau\nu}$ . As submatrices $A_{\tau\nu}$ have significantly smaller dimensions than $A$ (e.g., when $n_{b}=O(n^{2})$ as an extreme case), both BACA and ACA algorithms become more robust to attain the truncated SVD. Following compression of $n_{b}$ submatrices $A_{\tau\nu}$ by BACA or ACA at step $l=0$ , there are multiple approaches to combine them into one low-rank product including randomized algorithms via applying $A$ to random matrices, and deterministic algorithms via recursively pair-wise re-compressing the blocks using low-rank arithmetic. Here we choose the deterministic algorithm for simplicity of rank estimation and parallelization. Here, we deploy truncated SVD as the re-compression tool but other tools such as ID, QR, UTV can also be applied. Fig. 1(b) illustrates one re-compression operation for transforming SVDs of $A_{\tau_{i}\nu_{j}},i=1,2,j=1,2$ into that of $A_{\tau\nu}$ . The operation first horizontally compresses SVDs of $A_{\tau_{i}\nu_{j}},i=1,2,j=1,2$ at step $l-\frac{1}{2}$ and then vertically compresses the results, i.e., SVDs of $A_{\tau_{i}\nu},i=1,2$ at step $l$ , $l=1,..,L$ . Specifically, the horizontal compression step is composed of one concatenation operation in (14) and one compression operation in (15):

[TABLE]

with $i=1,2$ . Let $\bar{U}_{\tau_{i}\nu}\bar{V}_{\tau_{i}\nu}$ and $U_{\tau_{i}\nu}\Sigma_{\tau_{i}\nu}V_{\tau_{i}\nu}$ denote the submatrix before and after the SVD truncation, respectively. Similarly, the vertical compression step can be performed via horizontal merge of $A_{\tau_{i}\nu}^{t},i=1,2$ . Let $s_{l}$ represent the maximum rank $r_{\tau\nu}$ among all blocks at steps $l=0,1,...,L$ . Note that the algorithm returns an approximate truncated SVD after $L$ steps. As an example, the hierarchical merge algorithm with the level count of the hierarchical merge $L=2$ and $n_{b}=16$ is illustrated in Fig. 2. At step $l=0$ , the algorithm compresses all $n_{b}$ submatrices with BACA; at step $l=0.5,1.5$ , the algorithm merges every horizontal pair of blocks; similarly at level $l=1,2$ , the algorithm merges every vertical pair of blocks. Note that blocks surrounded by solid lines represent results after compression at each step $l$ .

The above-described hierarchical algorithm with BACA for leaf-level compressions, is dubbed H-BACA (Algorithm 3). In the following, a distributed-memory implementation of the H-BACA algorithm is described. Without loss of generality, it is assumed that $m=n=2^{i}$ and $p=2^{j}$ . The proposed parallel implementation first creates two $\lceil\mathrm{log}\sqrt{p}\rceil$ -level binary trees with $p$ denoting the total number of MPI processes. One process performs BACA compression of one or two leaf-level submatrices and low-rank merge operations from the bottom up until it reaches a submatrix shared by more than one process. Then, all such blocks are handled by PBLAS and ScaLAPACK with BLACS process grids that aggregate those in corresponding submatrices. Consider the example in Fig. 2 with process count $p=8$ . The workload of each process is labeled with its process rank and highlighted with one color. The dashed lines represent the ScaLAPACK blocks. First, BACA compressions and merge operations at $l=0,0.5$ are handled locally by one process without any communication. Next, merge operations at $l=1,1.5,2$ are handled by BLACS grids of $2\times 1$ , $2\times 2$ , and $4\times 2$ , respectively. For illustration purposes, we select the ScaLAPACK block size in Fig. 2 as $n_{0}\times n_{0}$ where $n_{0}$ is the dimension of the finest-level submatrices in the hierarchical merge algorithm and $n=\sqrt{n_{b}}n_{0}$ . In this case, the only required data redistribution is from step $l=1$ to $l=1.5$ . However, the ScaLAPACK block size may be set to much smaller numbers in practice, requiring data redistribution at each row/column re-compression step. Similarly, the requirement of $m=n=2^{i}$ and $p=2^{j}$ is not needed in practice.

4 Cost Analysis

In this section, the costs for computation and communication of the proposed BACA and H-BACA algorithms are analyzed.

4.1 Computational Cost

First, the costs for BACA can be summarized as follows. Assuming BACA converges in $O(\lceil{r}/{d}\rceil)$ iterations, each iteration performs entry evaluation from the residual matrices, QRCP for pivot selection, LRID for forming the LR product, and estimation of matrix norms. The entry evaluation computes $O(nd)$ entries each requiring $O(r_{k})$ operations; QRCP on block rows requires $O(nd^{2})$ operations; the LRID algorithm requires $O(ndd_{k}+d_{k}d^{2})$ operations; norm estimation requires $O(nr_{k}d_{k})$ operations. Summing up these costs, the overall cost for the BACA algorithm is

[TABLE]

Here we assume the block size $d\leq r$ . Note that when $d\gg r$ (e.g., $d=O(n)$ ), it follows that the worst-case complexity is $c_{BACA}=O(n^{2}r)$ by bypassing the pivot selection step that causes the $nd^{2}$ term. In practice, one would always avoid the case of $d\gg r$ .

Next, the computational costs of the H-BACA algorithm are analyzed. The costs are analyzed for two cases of distributions of the maximum ranks $s_{l}$ at each level, i.e., $s_{l}=r$ (ranks stay constant during the merge) and $s_{l}\approx 2^{l}r/\sqrt{n_{b}}=2^{l-L}r$ (rank increases by a factor of $2$ per level), $l=0,1,...,L$ . The constant-rank case is often valid for matrices with their numerical ranks independent of matrix dimensions (e.g., random low-rank matrices, matrices representing well-separated interactions from low-frequency and static wave equations and certain quantum chemistry matrices); the increasing-rank case holds true for matrices whose ranks depend polynomially (with order no bigger than 1) on the matrix dimensions (e.g., those arising from high-frequency wave equations, matrices representing near-field interactions from low-frequency and static wave equations, and certain classes of kernel methods on high dimensional data sets). From the aforementioned analysis of BACA, the computational costs for the leaf-level compression $c_{b}=c_{BACA}n_{b}$ are:

[TABLE]

which represent the complexity with ACA when $n_{b}=1$ .

Let $n_{l}=2^{l}n/\sqrt{n_{b}}$ denote the size of submatrices $A_{\tau,\nu}$ at level $l$ . The computational costs $c_{m}$ of hierarchical merge operations can be estimated as

[TABLE]

Accounting for the two cases of rank distributions, the computational costs for the leaf-level BACA and hierarchical merge operations of the H-BACA algorithm are summarized in Table 1. Note that the costs of the BACA algorithm can also be extracted from Table 1 upon setting $n_{b}=1$ . Not surprisingly, the hierarchical merge algorithm induces a computational overhead of at most $\sqrt{n_{b}}$ when ranks stay constant; the leaf-level compression can have a $1/\sqrt{n_{b}}$ reduction factor for the increasing rank case and $\sqrt{n_{b}}$ overhead for the constant rank case.

For completeness, the comparison between the proposed BACA, H-BACA algorithms (assuming $d\leq r_{0}$ ) and existing ACA algorithms are given in Table 2. In contrast to existing ACA algorithms that select one pivot at a time, BACA and H-BACA select $d$ and $n_{b}d$ pivots simultaneously. As such, H-BACA is the most robust algorithm among all listed here. Not surprisingly, H-BACA can induce a computational overhead of $\sqrt{n_{b}}$ .

4.2 Communication Cost

As the leaf-level BACA compression requires no communication, only the communication costs for the hierarchical merge operations are analyzed here. Since the merge operations may introduce an $O(\sqrt{n_{b}})$ computational overhead, one would only increase $n_{b}$ to create more parallelism, i.e., the process count $p\approx n_{b}$ . Let $p_{l}=4^{l}$ denote the number of processes involved in one level $l$ merge operation, $l=1,...,L$ . The operation requires redistribution between process grids of sizes $p_{l}$ , $2p_{l}$ and $4p_{l}$ (see the example in Fig. 2). Each process grid involves a PDGEMM function in PBLAS to combine the low-rank products and a PDGESVD function in ScaLAPACK to compute the new rank after the combination (see Fig. 1(b)). Let the pair [#messages, volume] denote the communication cost including the number of messages and the number of words transferred along the critical path. Then the communication costs for each (BLACS) grid redistribution, PDGEMM and PDGESVD during the hierarchical merge are $[O(1),O(n_{l}s_{l}/p_{l})]$ , $[O(s_{l}),O(n_{l}s_{l}/\sqrt{p_{l}})]$ , and $[O(s_{l}\mathrm{log}p_{l}),O(n_{l}s_{l}\mathrm{log}p_{l}/\sqrt{p_{l}})]$ , respectively. Recall that $n_{l}=2^{l}n/\sqrt{p}$ and $s_{l}$ denote the size and rank of submatrices at level $l$ and note that $n_{l}\gg s_{l}$ . Therefore the communication cost $v_{m}$ of the hierarchical merge (and H-BACA) can be estimated as

[TABLE]

Consider the two cases of rank distributions, i.e., $s_{l}=r$ and $s_{l}\approx 2^{l-L}r$ , the overall communication costs of H-BACA are $v_{m}=[O(r\mathrm{log}^{2}p),O(nr\mathrm{log}^{2}p/\sqrt{p})]$ and $v_{m}=[O(r\mathrm{log}p),O(nr\mathrm{log}p/\sqrt{p})]$ , respectively (see Table 1).

5 Numerical Results

This section presents several numerical results to demonstrate the accuracy and efficiency of the proposed H-BACA algorithm. The matrices in all numerical examples are generated from the following kernels: 1. Gaussian kernel: $A_{i,j}=\exp(\frac{-\left\lVert x_{i}-x_{j}\right\rVert^{2}}{2h^{2}})$ , $i,j=1,...,2n$ . Here $h$ is the Gaussian width, and $x_{i}\in\mathbb{R}^{8\times 1}$ and $\mathbb{R}^{784\times 1}$ are feature vectors in one subset of the SUSY and MNIST Data Sets from the UCI Machine Learning Repository Dheeru and Karra Taniskidou (2017), respectively. Note that the Gaussian kernel permits low-rank compression as shown in Wang et al. (2017); Bach (2013); Musco and Musco (2017) 2. EFIE2D kernel: $A_{i,j}=H_{0}^{(2)}(k\left\lVert x_{i}-x_{j}\right\rVert)$ resulting from the Nyström discretization of the electric field integral equation (EFIE) for electromagnetic scattering from 2-D curves. Here $H_{0}^{(2)}$ is the second kind Hankel function of order 0, $k$ is the free-space wavenumber, $x_{i},x_{j}\in\mathbb{R}^{2\times 1}$ are discretization points (15 points per wavelength) of two 2-D parallel strips of length $1$ and distance $1$ . 3. EFIE3D kernel: $A$ is obtained by the Galerkin method for EFIE to analyze electromagnetic scattering from 3-D surfaces. 4. Frontal3D kernel: $A$ is a dense frontal matrix that arises from the multifrontal sparse elimination for the finite-difference frequency-domain solution of the homogeneous-coefficient Helmholtz equation inside a unit cube. 5. Polynomial kernel: $A_{i,j}=(x_{i}^{t}x_{j}+h)^{2}$ . Here $x_{i},x_{j}\in\mathbb{R}^{50\times 1}$ are points from a randomly generated dataset, and $h$ is a regularization parameter. 6. Product-of-random kernel: $A=UV$ with $U\in\mathbb{R}^{n\times r}$ and $V\in\mathbb{R}^{r\times n}$ being random matrices with i.i.d. entries. Note that the EFIE2D, EFIE3D and Frontal3D kernels result in complex-valued matrices. Throughout this section, we refer to ACA as a special case of BACA when $d=1$ . In all examples except for the Product-of-random kernel, the algorithm is applied to the offdiagonal submatrix ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{A_{12}}}=A(1:n,1+n:2n)$ assuming rows/columns of $A$ have been properly permuted (e.g., by a KD-tree partitioning scheme). Note that the permutation may yield a hierarchical matrix representation of $A$ , but in this paper we only focus on compression of one off-diagonal subblock of $A$ with H-BACA. All experiments are performed on the Cori Haswell machine at NERSC, which is a Cray XC40 system and consists of 2388 dual-socket nodes with Intel Xeon E5-2698v3 processors running 16 cores per socket. The nodes are configured with 128 GB of DDR4 memory at 2133 MHz.

5.1 Convergence

First, the convergence of the proposed BACA algorithm is investigated using several matrices: Gaussian-SUSY matrices with $n=5000$ , $h=1.0,0.2$ , an EFIE3D matrix for a unit sphere with $n=21788$ and approximately 20 points per wavelength, and a Frontal3D matrix with $n=1250$ and 10 points per wavelength. The corresponding $\epsilon$ -ranks are $r=4683,1723,1488,718$ for $\epsilon=10^{-6}$ . The residual histories versus revealed ranks $r_{k}$ , at each iteration $k$ of BACA with $1\leq d\leq 256$ are plotted in Fig. 3. The residual error is defined as $\left\lVert U_{k}V_{k}\right\rVert_{F}/\left\lVert UV\right\rVert_{F}$ from (12). As a reference, the singular value spectra $\Sigma(k,k)/\Sigma(1,1)$ computed from $[U,\Sigma,V,r]=\mathtt{SVD}(A,\epsilon)$ are also plotted.

For the Gaussian-SUSY matrices, the baseline ACA algorithm ( $d=1$ ) behaves poorly with smaller $h$ due to the exponential decay of the Gaussian kernel. As a result, the matrix becomes increasingly sparse and coherent for small $h$ particularly for high dimensional data sets. In fact, ACA constantly selects smaller pivots and the residual exhibits wild oscillations particularly for smaller $h$ (e.g., when $h=0.2$ in Fig. 3(b)). Similarly, the analytical and numerical Green’s functions respectively for the EFIE3D (Fig. 3(c)) and Frontal3D (Fig. 3(d)) matrices are not asymptotically smooth for ACA to converge rapidly. For all examples in Fig. 3, significant portions of the residual curves lie below the singular value spectra which causes premature iteration termination for certain given residual errors. In stark contrast, the proposed BACA algorithm ( $d=32,64,100,128,256$ ) shows increasingly smooth residual histories residing above the singular value spectra as the block size $d$ increases. Although BACA may overestimate the matrix ranks particularly for larger $d$ , the SVD re-compression step mentioned in Section Blocked Adaptive Cross Approximation can effectively reduce the ranks.

5.2 Accuracy

Next, the accuracy of the H-BACA algorithm is demonstrated using the following matrices: two Gaussian-SUSY matrices with $n=5000$ , $h=1.0,0.2$ , one EFIE3D matrix for a unit sphere with $n=1707$ and approximately 20 points per wavelength, and a Frontal3D matrix with $n=1250$ and 10 points per wavelength. The relative Frobenious-norm error $\left\lVert A-UV\right\rVert_{F}/\left\lVert A\right\rVert_{F}$ is computed for changing number of leaf-level submatrices $n_{b}$ and block size $d$ . When $h=1.0$ for the Gaussian-SUSY matrix (Fig. 4a), the H-BACA algorithms achieve desired accuracies ( $\epsilon=10^{-2},10^{-6},10^{-10}$ ) using the baseline ACA ( $d=1$ ), and BACA ( $d=32$ ) when $n_{b}=1$ and the hierarchical merge operation only causes slight error increases as $n_{b}$ increases. However when $h=0.2$ for the Gaussian-SUSY matrix (Fig. 4b), all data points for H-BACA with $d=1$ fail due to the wildly oscillating residual histories. In contrast, H-BACA with $d=32$ achieves significantly better accuracies for most data points particularly as $n_{b}$ increases. For the EFIE3D (Fig. 4c) and Frontal3D (Fig. 4d) matrices, H-BACA with $d=32$ achieves comparable accuracies as H-BACA with $d=1$ for most data points. Note that $d=32$ is significantly better than $d=1$ when the prescribed residual error is large ( $\epsilon=10^{-2}$ ). This agrees with the residual histories in Fig. 3(c) and Fig. 3(d) as they lie below the singular value spectra when iteration count $k$ is small.

5.3 Efficiency

This subsection provides six examples to verify the computational complexity estimates in Table 1. H-BACA with leaf-level ACA ( $d=1$ ) and BACA ( $d=8,16,32,64,128$ ) is tested for the following matrices: one Gaussian-SUSY matrix with $n=50000$ , $h=1.0$ , $\epsilon=10^{-2}$ , one Gaussian-MNIST matrix with $n=5000$ , $h=3.0$ , $\epsilon=10^{-2}$ , one EFIE3D matrix for a unit sphere with $n=26268$ , $\epsilon=10^{-6}$ and $20$ points per wavelength, one Frontal3D matrix with $n=1250$ , $\epsilon=10^{-6}$ and 10 points per wavelength, one Polynomial matrix with $n=10000$ , $h=0.2$ , $\epsilon=10^{-4}$ , and one Product-of-random matrix with $n=2500$ , $\epsilon=10^{-4}$ . The corresponding $\epsilon$ -ranks are 298, 137, 1488, 788, 450 and 1000, respectively. It can be validated that the hierarchical merge operation attains increasing ranks for the Gaussian, EFIE3D and Frontal3D matrices, and relatively constant ranks for the Polynomial, and Product-of-random matrices. All examples use one process except that the Gaussian-SUSY example uses 16 processes. The CPU times are measured and plotted in Fig. 5.

Table I predicts that H-BACA exhibits increasing (with a factor of $\sqrt{n_{b}}$ ) and constant time when $s_{l}$ stays constant and increases, respectively. Note that the rank assumption $s_{l}\approx r$ leading to the $O(\sqrt{n_{b}})$ computational overhead may not be fully observed for practical values of $n_{b}$ and $n$ . Given one matrix, $s_{l}$ may stay approximately constant for a limited number of subdivision levels $l$ . For example, $s_{l}$ stay constant for bottom levels of EFIE3D and Frontal3D matrices, and top levels of Polynomial and Product-of-random matrices. This agrees with the observed scalings (w.r.t $n_{b}$ ) in Fig. 5(c) - 5(f). As a reference, the $O(\sqrt{n_{b}})$ curves are plotted and only small ranges of $n_{b}$ exhibit the $O(\sqrt{n_{b}})$ overhead. For the Gaussian matrices, we even observe non-increasing CPU time w.r.t. $n_{b}$ when $n_{b}$ is not too big. (see Fig. 5(a) and 5(b)).

The effects of varying block size $d$ also deserve further discussions. First, larger block size $d$ can significantly improve the robustness of H-BACA for the Gaussian matrices. For example, H-BACA does not achieve desired accuracies due to premature termination for all data points on the $d=1$ curve in Fig. 5(a) and $d=1,8$ curves in Fig. 5(b). In contrast, H-BACA with larger $d$ attains desired accuracies. Second, larger block size $d$ results in reduced CPU time for the Polynomial and Frontal3D matrices due to better BLAS performance (see Fig. 5(d) and 5(e)). For the other tested matrices, no significant performance differences have been observed by changing block size $d$ . However, for matrices with ranks $s_{0}\leq d$ , larger $d$ and $n_{b}$ can introduce significant overheads.

5.4 Parallel Performance

Finally, the parallel performance of the H-BACA algorithm is demonstrated via strong scaling studies with the EFIE2D, EFIE3D, Product-of-random and Gaussian matrices with process counts $p=8,...,1024$ . For the EFIE2D matrices, $n=160000$ and the wavenumbers are chosen such that the $\epsilon$ -ranks with $\epsilon=10^{-4}$ are $937$ and $107$ , respectively. For the EFIE3D matrices for a unit square, $n=21788$ and the wavenumbers are chosen such that the $\epsilon$ -ranks with $\epsilon=10^{-6}$ are $1007$ and $598$ , respectively. For the Product-of-random matrices, $n=10000$ and the inner dimension of the product is set to $r=2000$ and $800$ , respectively. For the Gaussian matrices with a randomly generated dataset of dimension $50$ and $n=10000$ , we choose $h=1.0$ and $h=1.6$ such that the $\epsilon$ -ranks with $\epsilon=10^{-3}$ are $2106$ and $191$ , respectively. In all examples, the block size and number of leaf-level subblocks in H-BACA are chosen as $d=8$ and $\sqrt{n_{b}}=\lceil\sqrt{p}\rceil$ . The ScaLAPACK block size is set to $64\times 64$ . As the reference, we compare to a straightforward parallel implementation of the baseline ACA algorithm which essentially parallelize every operation in ACA with collective MPI communications.

For all examples, the parallel ACA algorithm stops scaling when $p$ is sufficiently large (see Fig. 6). In contrast, the proposed parallel H-BACA algorithm scales up to $p=1024$ . In most examples, H-BACA achieves better parallel efficiencies with larger ranks due to better process utilization during the hierarchical merge operation. We also note that ACA outperforms H-BACA for the Product-of-random matrices with small process count $p$ (and $n_{b}$ ). This is partially attributed to the $O(\sqrt{n_{b}})$ overhead observed in Fig. 5(f).

Overall, the parallel H-BACA algorithm can achieve reasonably good parallel performances for rank-deficient matrices with modest to large numerical ranks. Not surprisingly, the parallel runtime is dominated by that of ScaLAPACK computation and possible redistributions between each re-compression step as analyzed in Section Cost Analysis. Also note that the leaf-level BACA compression is embarrassingly parallel for all test cases.

6 Conclusion

This paper presents a parallel and purely algebraic ACA-type matrix decomposition algorithm given that any matrix entry can be evaluated in $O(1)$ time. Two proposed strategies, BACA and H-BACA, are leveraged to improve the robustness and parallel efficiency of the (baseline) ACA algorithm for general rank-deficient matrices.

First, the BACA algorithm searches for blocks of row/column pivots via column-pivoted QR on the column/row submatrices at each iteration. The blocking nature of BACA provides a closer estimation of the true residual error and reduces the chance of selecting smaller pivots when compared to ACA. Therefore, BACA exhibits a much smoother and more reliable convergence history. Moreover, blocked operations also benefit from higher flop performance compared to non-blocked ones. For a rank-deficient matrix with dimension $n$ and $\epsilon$ -rank $r$ , the computational cost of BACA is $O(nr^{2})$ assuming the block size constant and iteration count $O(r)$ .

Second, the H-BACA algorithm divides the matrix into $n_{b}$ similar-sized submatrices each compressed with BACA and then hierarchically merges the results using low-rank arithmetic. Depending on the rank behaviors of submatrices during the merge, the H-BACA may have a computational overhead of $O(\sqrt{n_{b}})$ yielding the overall computational cost at most $O(nr^{2}\sqrt{n_{b}})$ . The H-BACA algorithm can be parallelized with distributed-memory machines by assigning each process to one submatrix and leveraging PBLAS and ScaLAPACK for the hierarchical merge operation. Such parallelization strategy yields a much more favorable communication cost when compared to the straightforward parallelization of ACA/BACA with collective MPI routines. Not surprisingly, good parallel performance can be achieved for matrices with modest to large numerical ranks which increases process utilization for each merge operation.

In contrast to the baseline ACA algorithm, the proposed algorithms exhibit improved robustness and favorable parallel performance with low computational overheads for broad ranges of matrices arising from many science and engineering applications.

{dci}

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

{funding}

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program through the FASTMath Institute under Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.

{acks}

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bach (2013) Bach F (2013) Sharp analysis of low-rank kernel matrix approximations. In: Proceedings of the 26th Annual Conference on Learning Theory , Proceedings of Machine Learning Research , volume 30. Princeton, NJ, USA: PMLR, pp. 185–209.
2Balzano et al. (2010) Balzano L, Nowak R and Recht B (2010) Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton) . pp. 704–711. 10.1109/ALLERTON.2010.5706976 . · doi ↗
3Bebendorf (2000) Bebendorf M (2000) Approximation of boundary element matrices. Numerische Mathematik 86(4): 565–589. 10.1007/PL 00005410 . · doi ↗
4Bebendorf and Grzhibovskis (2006) Bebendorf M and Grzhibovskis R (2006) Accelerating Galerkin BEM for linear elasticity using adaptive cross approximation. Mathematical Methods in the Applied Sciences 29(14): 1721–1747. 10.1002/mma.759 . · doi ↗
5Blackford et al. (1997) Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D and Whaley RC (1997) Sca LAPACK users’ guide . Philadelphia, PA: Society for Industrial and Applied Mathematics. ISBN 0-89871-397-8 (paperback).
6Boutsidis et al. (2009) Boutsidis C, Mahoney MW and Drineas P (2009) An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’09. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, pp. 968–977.
7Candès and Recht (2009) Candès EJ and Recht B (2009) Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6): 717. 10.1007/s 10208-009-9045-5 . · doi ↗
8Cheng et al. (2005) Cheng H, Gimbutas Z, Martinsson P and Rokhlin V (2005) On the compression of low rank matrices. SIAM Journal on Scientific Computing 26(4): 1389–1404. 10.1137/030602678 . · doi ↗