Fast Randomized Matrix and Tensor Interpolative Decomposition Using   CountSketch

Osman Asif Malik; Stephen Becker

arXiv:1901.10559·math.NA·December 20, 2024

Fast Randomized Matrix and Tensor Interpolative Decomposition Using CountSketch

Osman Asif Malik, Stephen Becker

PDF

1 Repo

TL;DR

This paper introduces a fast randomized algorithm for matrix and tensor interpolative decomposition using CountSketch, offering significant speed improvements while maintaining accuracy, applicable to large-scale data.

Contribution

The paper presents a novel CountSketch-based randomized algorithm for matrix and tensor interpolative decomposition with theoretical guarantees and improved computational efficiency.

Findings

01

Achieves at least an order of magnitude speed-up on large matrices and tensors.

02

Maintains accuracy comparable to existing methods.

03

Provides theoretical performance guarantees for both matrix and tensor cases.

Abstract

We propose a new fast randomized algorithm for interpolative decomposition of matrices which utilizes CountSketch. We then extend this approach to the tensor interpolative decomposition problem introduced by Biagioni et al. (J. Comput. Phys. 281, pp. 116-134, 2015). Theoretical performance guarantees are provided for both the matrix and tensor settings. Numerical experiments on both synthetic and real data demonstrate that our algorithms maintain the accuracy of competing methods, while running in less time, achieving at least an order of magnitude speed-up on large matrices and tensors.

Figures2

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Comparison of the complexity for matrix ID algorithms.

Algorithm for matrix ID	Complexity
Standard (Alg. 1)	$K I R$
Gaussian	$K nnz (𝐀) + K^{2} R$
SRFT	$I R \log (K) + K^{2} R$
CountSketch (Proposal, Alg. 2)	$nnz (𝐀) + K^{2} R$

Table 2. Table 2: Comparison of the complexity for tensor ID algorithms.

Algorithm for tensor ID	Complexity
Gram matrix	$N C_{mult} + R^{3}$
Gaussian	$N K nnz (𝐀) + K^{2} R$
CountSketch (Proposal, Alg. 3)	$N (nnz (𝐀) + R K \log K) + K^{2} R$

Table 3. Table 3: Errors and run times in the real-world matrix ID experiment. The errors are computed using the randomized spectral norm by Woolfe et al. ( 2008 ) .

Algorithm for matrix ID	Error	Run time (s)
Gaussian	1.505e−15	20.38
SRFT	1.507e−15	18.40
CountSketch (proposal)	1.504e−15	0.59

Table 4. Table 4: Number of experiments out of 60 for which method A is more accurate than method B.

	Method B
Method A	Our Proposal	SRFT	Gaussian
Our Proposal	-	57	34
SRFT	3	-	2
Gaussian	26	58	-

Equations192

X = r = 1 \sum R λ_{r} a_{: r}^{(1)} \circ a_{: r}^{(2)} \circ \dots \circ a_{: r}^{(N)} = r = 1 \sum R λ_{r} X^{(r)},

X = r = 1 \sum R λ_{r} a_{: r}^{(1)} \circ a_{: r}^{(2)} \circ \dots \circ a_{: r}^{(N)} = r = 1 \sum R λ_{r} X^{(r)},

\hat{X} = k = 1 \sum K \hat{λ}_{k} X^{(j_{k})} \approx X,

\hat{X} = k = 1 \sum K \hat{λ}_{k} X^{(j_{k})} \approx X,

\mathbf{M}=\begin{bmatrix}\lambda_{1}\operatorname{vec}(\bm{\mathscr{X}}^{(1)})&\cdots&\lambda_{R}\operatorname{vec}(\bm{\mathscr{X}}^{(R)})\end{bmatrix}=\bigg{(}\operatorname*{\bigodot}_{n=1}^{N}\mathbf{A}^{(n)}\bigg{)}\textnormal{diag}(\lambda_{1},\ldots,\lambda_{R}),

\mathbf{M}=\begin{bmatrix}\lambda_{1}\operatorname{vec}(\bm{\mathscr{X}}^{(1)})&\cdots&\lambda_{R}\operatorname{vec}(\bm{\mathscr{X}}^{(R)})\end{bmatrix}=\bigg{(}\operatorname*{\bigodot}_{n=1}^{N}\mathbf{A}^{(n)}\bigg{)}\textnormal{diag}(\lambda_{1},\ldots,\lambda_{R}),

M^{⊤} M = (A^{(1) ⊤} A^{(1)}) ⊛ \dots ⊛ (A^{(N) ⊤} A^{(N)}),

M^{⊤} M = (A^{(1) ⊤} A^{(1)}) ⊛ \dots ⊛ (A^{(N) ⊤} A^{(N)}),

\bm{\Omega}=\bigg{(}\operatorname*{\bigodot}_{n=1}^{N}\bm{\Omega}^{(n)}\bigg{)}^{\top}\in\mathbb{R}^{L\times\tilde{I}},

\bm{\Omega}=\bigg{(}\operatorname*{\bigodot}_{n=1}^{N}\bm{\Omega}^{(n)}\bigg{)}^{\top}\in\mathbb{R}^{L\times\tilde{I}},

H(i_{1},i_{2},\ldots,i_{N})\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\Big{(}\sum_{n=1}^{N}(h_{n}(i_{n})-1)\mod L\Big{)}+1,

H(i_{1},i_{2},\ldots,i_{N})\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\Big{(}\sum_{n=1}^{N}(h_{n}(i_{n})-1)\mod L\Big{)}+1,

S (i_{1}, i_{2}, \dots, i_{N}) = def n = 1 \prod N s_{n} (i_{n}) .

S (i_{1}, i_{2}, \dots, i_{N}) = def n = 1 \prod N s_{n} (i_{n}) .

\mathbf{T}\mathbf{A}=\textnormal{FFT}^{-1}\Big{(}\operatorname*{\scalerel*{\circledast}{\sum}}_{n=1}^{N}\textnormal{FFT}(\mathbf{S}^{(n)}\mathbf{A}^{(n)})\Big{)},

\mathbf{T}\mathbf{A}=\textnormal{FFT}^{-1}\Big{(}\operatorname*{\scalerel*{\circledast}{\sum}}_{n=1}^{N}\textnormal{FFT}(\mathbf{S}^{(n)}\mathbf{A}^{(n)})\Big{)},

2 β (K^{2} + K) \leq L < I .

2 β (K^{2} + K) \leq L < I .

∥ A_{: j} P - A ∥ ≲ 2 σ_{K + 1} (A) K I R

∥ A_{: j} P - A ∥ ≲ 2 σ_{K + 1} (A) K I R

∥ \hat{X} - X ∥_{F} ≲ 2 σ_{K + 1} (M) R K R \tilde{I}

∥ \hat{X} - X ∥_{F} ≲ 2 σ_{K + 1} (M) R K R \tilde{I}

∥ (A^{⊤} A)^{- 1} A^{⊤} ∥ = \frac{1}{σ _{R} ( A )} .

∥ (A^{⊤} A)^{- 1} A^{⊤} ∥ = \frac{1}{σ _{R} ( A )} .

∥ BP - A ∥ \leq ∥ XSA - A ∥ (∥ P ∥ + 1) + ∥ X ∥∥ SBP - SA ∥.

∥ BP - A ∥ \leq ∥ XSA - A ∥ (∥ P ∥ + 1) + ∥ X ∥∥ SBP - SA ∥.

∥ (I^{(K)} - L)^{- 1} ∥ \leq \frac{1}{1 - ∥ L ∥} .

∥ (I^{(K)} - L)^{- 1} ∥ \leq \frac{1}{1 - ∥ L ∥} .

C = def (SU)^{⊤} (SU),

C = def (SU)^{⊤} (SU),

e_{kk^{\prime}}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\sum_{\begin{subarray}{c}i,i^{\prime}\in[I]\\ i\neq i^{\prime}\end{subarray}}d_{ii}d_{i^{\prime}i^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}.

e_{kk^{\prime}}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\sum_{\begin{subarray}{c}i,i^{\prime}\in[I]\\ i\neq i^{\prime}\end{subarray}}d_{ii}d_{i^{\prime}i^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}.

c_{k k^{'}} = l \in [L] \sum (SU)_{l k} (SU)_{l k^{'}} .

c_{k k^{'}} = l \in [L] \sum (SU)_{l k} (SU)_{l k^{'}} .

(SU)_{l k} = i \in [I] \sum ϕ_{l i} d_{ii} u_{ik},

(SU)_{l k} = i \in [I] \sum ϕ_{l i} d_{ii} u_{ik},

c_{k k^{'}}

c_{k k^{'}}

\displaystyle=\sum_{i\in[I]}\sum_{l\in[L]}\phi_{li}^{2}d_{ii}^{2}u_{ik}u_{ik^{\prime}}+\sum_{\begin{subarray}{c}i,i^{\prime}\in[I]\\ i\neq i^{\prime}\end{subarray}}d_{ii}d_{i^{\prime}i^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}.

ϕ_{l i}^{2} = {10 if h (i) = l, otherwise,

ϕ_{l i}^{2} = {10 if h (i) = l, otherwise,

i \in [I] \sum l \in [L] \sum ϕ_{l i}^{2} d_{ii}^{2} u_{ik} u_{i k^{'}} = i \in [I] \sum u_{ik} u_{i k^{'}} = ⟨ u_{: k}, u_{: k^{'}} ⟩ = {10 if k = k^{'}, otherwise .

i \in [I] \sum l \in [L] \sum ϕ_{l i}^{2} d_{ii}^{2} u_{ik} u_{i k^{'}} = i \in [I] \sum u_{ik} u_{i k^{'}} = ⟨ u_{: k}, u_{: k^{'}} ⟩ = {10 if k = k^{'}, otherwise .

\Big{(}\frac{\alpha}{\alpha-1}\Big{)}^{2}\beta(K^{2}+K)\leq L<I.

\Big{(}\frac{\alpha}{\alpha-1}\Big{)}^{2}\beta(K^{2}+K)\leq L<I.

∥ E ∥ \leq 1 - \frac{1}{α}

∥ E ∥ \leq 1 - \frac{1}{α}

\mathbb{E}[e_{kk^{\prime}}^{2}]=\mathbb{E}\bigg{[}\sum_{\begin{subarray}{c}i,i^{\prime}\in[I]\\ i\neq i^{\prime}\end{subarray}}\sum_{\begin{subarray}{c}j,j^{\prime}\in[I]\\ j\neq j^{\prime}\end{subarray}}d_{ii}d_{i^{\prime}i^{\prime}}d_{jj}d_{j^{\prime}j^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}u_{jk}u_{j^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}\Big{(}\sum_{l\in[L]}\phi_{lj}\phi_{lj^{\prime}}\Big{)}\bigg{]}.

\mathbb{E}[e_{kk^{\prime}}^{2}]=\mathbb{E}\bigg{[}\sum_{\begin{subarray}{c}i,i^{\prime}\in[I]\\ i\neq i^{\prime}\end{subarray}}\sum_{\begin{subarray}{c}j,j^{\prime}\in[I]\\ j\neq j^{\prime}\end{subarray}}d_{ii}d_{i^{\prime}i^{\prime}}d_{jj}d_{j^{\prime}j^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}u_{jk}u_{j^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}\Big{(}\sum_{l\in[L]}\phi_{lj}\phi_{lj^{\prime}}\Big{)}\bigg{]}.

\mathbb{E}\bigg{[}d_{ii}d_{i^{\prime}i^{\prime}}d_{jj}d_{j^{\prime}j^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}u_{jk}u_{j^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}\Big{(}\sum_{l\in[L]}\phi_{lj}\phi_{lj^{\prime}}\Big{)}\bigg{]}=0,

\mathbb{E}\bigg{[}d_{ii}d_{i^{\prime}i^{\prime}}d_{jj}d_{j^{\prime}j^{\prime}}u_{ik}u_{i^{\prime}k^{\prime}}u_{jk}u_{j^{\prime}k^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}\Big{(}\sum_{l\in[L]}\phi_{lj}\phi_{lj^{\prime}}\Big{)}\bigg{]}=0,

E [e_{k k^{'}}^{2}]

E [e_{k k^{'}}^{2}]

\displaystyle+\sum_{\begin{subarray}{c}i,i^{\prime}\in[I]\\ i\neq i^{\prime}\end{subarray}}\mathbb{E}\bigg{[}d_{ii}^{2}d_{i^{\prime}i^{\prime}}^{2}u_{ik}u_{i^{\prime}k^{\prime}}u_{i^{\prime}k}u_{ik^{\prime}}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}^{2}\bigg{]}.

\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}^{2}=\begin{cases}1&\text{if }h(i)=h(i^{\prime}),\\ 0&\text{otherwise}.\end{cases}

\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}^{2}=\begin{cases}1&\text{if }h(i)=h(i^{\prime}),\\ 0&\text{otherwise}.\end{cases}

\mathbb{E}\bigg{[}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}^{2}\bigg{]}=1\times\frac{1}{L}+0\times\Big{(}1-\frac{1}{L}\Big{)}=\frac{1}{L}.

\mathbb{E}\bigg{[}\Big{(}\sum_{l\in[L]}\phi_{li}\phi_{li^{\prime}}\Big{)}^{2}\bigg{]}=1\times\frac{1}{L}+0\times\Big{(}1-\frac{1}{L}\Big{)}=\frac{1}{L}.

E [e_{k k^{'}}^{2}] = \frac{1}{L} i, i^{'} \in [I] i \neq = i^{'} \sum u_{ik}^{2} u_{i^{'} k^{'}}^{2} + \frac{1}{L} i, i^{'} \in [I] i \neq = i^{'} \sum u_{ik} u_{i^{'} k^{'}} u_{i^{'} k} u_{i k^{'}} .

E [e_{k k^{'}}^{2}] = \frac{1}{L} i, i^{'} \in [I] i \neq = i^{'} \sum u_{ik}^{2} u_{i^{'} k^{'}}^{2} + \frac{1}{L} i, i^{'} \in [I] i \neq = i^{'} \sum u_{ik} u_{i^{'} k^{'}} u_{i^{'} k} u_{i k^{'}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OsmanMalik/countsketch-matrix-tensor-id
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: O. A. Malik 22institutetext: Department of Applied Mathematics, University of Colorado Boulder

22email: [email protected]

33institutetext: S. Becker 44institutetext: Department of Applied Mathematics, University of Colorado Boulder

44email: [email protected]

Fast Randomized Matrix and Tensor Interpolative Decomposition Using CountSketch††thanks: This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/s10444-020-09816-9

Osman Asif Malik

Stephen Becker

(Received: date / Accepted: date)

Abstract

We propose a new fast randomized algorithm for interpolative decomposition of matrices which utilizes CountSketch. We then extend this approach to the tensor interpolative decomposition problem introduced by Biagioni et al. (J. Comput. Phys. 281, pp. 116–134, 2015). Theoretical performance guarantees are provided for both the matrix and tensor settings. Numerical experiments on both synthetic and real data demonstrate that our algorithms maintain the accuracy of competing methods, while running in less time, achieving at least an order of magnitude speed-up on large matrices and tensors.

Keywords:

Matrix Decomposition Tensor Decomposition Sketching

MSC:

15-02

††journal: Advances in Computational Mathematics

1 Introduction

Matrix decomposition is a fundamental tool used to compress and analyze data, and to improve the speed of computations. For data and computational problems involving more than two dimensions, analogous tools in the form of tensors and associated decompositions have been developed (Kolda and Bader, 2009). In many modern applications, matrices and tensors can be very large, which makes decomposing them especially challenging. One approach to dealing with this problems is to incorporate randomization in decomposition algorithms (Halko et al., 2011). In this paper, we consider the interpolative decomposition (ID) for matrices, as well as the tensor ID problem. By tensor ID, we mean the tensor rank reduction problem as introduced by Biagioni et al. (2015); we provide an exact definition in Section 1.2.2. We make the following contributions in this paper:

•

We propose a new fast randomized algorithm for matrix ID and provide theoretical performance guarantees.

•

We propose a new randomized algorithm for tensor ID. To the best of our knowledge, we provide the first performance guarantees for any randomized tensor ID algorithm.

•

We validate our algorithms on both synthetic and real data.

•

We propose a small modification to the standard CountSketch formulation which helps avoid certain rank deficiency issues and slightly strengthen our matrix ID results.

1.1 Tensors and the CP Decomposition

For a more complete introduction to tensors and their decompositions, see the review paper by Kolda and Bader (2009). A tensor $\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}$ is an $N$ -dimensional array of real numbers, also called an $N$ -way tensor. The number of elements in such a tensor is denoted by $\tilde{I}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\prod_{n=1}^{N}I_{n}$ . Boldface Euler script letters, e.g. $\bm{\mathscr{X}}$ , denote tensors of dimension 3 or greater; bold capital letters, e.g. $\mathbf{X}$ , denote matrices; bold lowercase letters, e.g. $\mathbf{x}$ , denote vectors; and lowercase letters, e.g. $x$ , denote scalars. Uppercase letters, e.g. $I$ , are used to denote scalars indicating dimension size. A colon is used to denote all elements along a certain dimension. For example, $\mathbf{x}_{m:}$ and $\mathbf{x}_{:n}$ are the $m$ th row and $n$ th column of the matrix $\mathbf{X}$ , respectively. If $\mathbf{j}$ is a vector of column indices, then $\mathbf{X}_{:\mathbf{j}}$ denotes the submatrix of $\mathbf{X}$ consisting of the columns of $\mathbf{X}$ whose indices are listed in $\mathbf{j}$ . $\mathbf{I}^{(K)}$ denotes the $K\times K$ identity matrix. For a matrix $\mathbf{X}$ , $\sigma_{i}(\mathbf{X})$ denotes its $i$ th singular value, and $\sigma_{\text{max}}(\mathbf{X})$ and $\sigma_{\text{min}}(\mathbf{X})$ denote the maximum and minimum singular values, respectively. The condition number of $\mathbf{X}$ is defined as $\kappa(\mathbf{X})\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\sigma_{\text{max}}(\mathbf{X})/\sigma_{\text{min}}(\mathbf{X})$ . The number of nonzero elements of $\mathbf{X}$ is denoted by $\textnormal{nnz}(\mathbf{X})$ . For positive integers $m$ and $n>m$ , let $[m]\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\{1,2,\ldots,m\}$ and $[m:n]\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\{m,m+1,\ldots,n\}$ . The Hadamard product, or element-wise product, of matrices is denoted by $\circledast$ . The Khatri–Rao product of matrices is denoted by $\odot$ . The tensor Frobenius norm is denoted by $\|\bm{\mathscr{X}}\|_{\textup{F}}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\|\operatorname{vec}(\bm{\mathscr{X}})\|_{2}$ , where $\operatorname{vec}(\bm{\mathscr{X}})$ flattens the tensor $\bm{\mathscr{X}}$ into a column vector. A norm $\|\cdot\|$ with no subscript will always denote the matrix spectral norm.

The singular value decomposition (SVD) decomposes matrices into a sum of rank-1 matrices (Golub and Van Loan, 2013). Similarly, the CP decomposition decomposes a tensor $\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}$ into a sum of rank-1 tensors:

[TABLE]

where $\circ$ denotes outer product, and each $\bm{\mathscr{X}}^{(r)}$ is a rank-1 tensor. Each $\lambda_{r}$ is called an s-value, each $\mathbf{A}^{(n)}=[\mathbf{a}_{:1}^{(n)}\;\;\mathbf{a}_{:2}^{(n)}\;\;\cdots\;\;\mathbf{a}_{:R}^{(n)}]$ is called a factor matrix, and all vectors $\mathbf{a}^{(n)}_{:r}$ have unit 2-norm. Usually, a tensor $\bm{\mathscr{X}}$ is said to be of rank- $R$ if $R$ is the smallest possible number of terms required in a representation of the form (1). We will use the term “rank” in a looser sense to mean the (not necessarily minimal) number of rank-1 terms in a representation of the form (1).

1.2 Interpolative Decomposition

1.2.1 Matrix Interpolative Decomposition

For a matrix $\mathbf{A}\in\mathbb{R}^{I\times R}$ , a rank- $K$ interpolative decomposition (ID) takes the form $\mathbf{A}\approx\mathbf{A}_{:\mathbf{j}}\mathbf{P}$ , where $\mathbf{A}_{:\mathbf{j}}\in\mathbb{R}^{I\times K}$ consists of a subset of $K<R$ columns from $\mathbf{A}$ , and $\mathbf{P}\in\mathbb{R}^{K\times R}$ is a coefficient matrix which is well-conditioned in some sense. The fact that the decomposition is expressed in terms of the columns of $\mathbf{A}$ means that $\mathbf{A}_{:\mathbf{j}}$ inherits properties such as sparsity and non-negativity from $\mathbf{A}$ . Moreover, expressing the decomposition in terms of columns of $\mathbf{A}$ can increase interpretability. Algorithm 1 outlines one method to compute a matrix ID.

Fact 1

If the partial QR factorization on line 3 in Algorithm 1 is done using the strongly rank-revealing QR (SRRQR) decomposition developed by Gu and Eisenstat (1996), then Algorithm 1 has complexity $O(IR^{2})$ (Cheng et al., 2005). Moreover, the decomposition it produces satisfies the following properties (Martinsson et al., 2011):

(i)

Some subset of the columns of $\mathbf{P}$ makes up the $K\times K$ identity matrix, 2. (ii)

no entry of $\mathbf{P}$ has an absolute value exceeding 2, 3. (iii)

$\|\mathbf{P}\|\leq\sqrt{4K(R-K)+1}$ , 4. (iv)

$\sigma_{\text{min}}(\mathbf{P})\geq 1$ , 5. (v)

$\mathbf{A}_{:\mathbf{j}}\mathbf{P}=\mathbf{A}$ when $K=I$ or $K=R$ , and 6. (vi)

$\|\mathbf{A}_{:\mathbf{j}}\mathbf{P}-\mathbf{A}\|\leq\sigma_{K+1}(\mathbf{A})\sqrt{4K(R-K)+1}$ when $K<\min(I,R)$ .

In practice, using a variant of column pivoted QR instead of the SRRQR on line 3 of Algorithm 1 works just as well, and reduces the complexity of the algorithm to $O(KIR)$ (Cheng et al., 2005).

There have been subsequent proposals for randomized versions of matrix ID (Liberty et al., 2007). Martinsson et al. (2011) propose a variant which incorporates Gaussian random sketching. It computes a sketch $\mathbf{Y}=\bm{\Omega}\mathbf{A}$ , where $\bm{\Omega}\in\mathbb{R}^{L\times I}$ ( $K<L<I$ ) is a matrix with iid standard normal entries, and then computes an ID $\mathbf{Y}\approx\mathbf{Y}_{:\mathbf{j}}\mathbf{P}$ . The same $\mathbf{j}$ and $\mathbf{P}$ then give an ID of $\mathbf{A}\approx\mathbf{A}_{:\mathbf{j}}\mathbf{P}$ . Woolfe et al. (2008) propose a similar fast randomized algorithm which uses a subsampled randomized fast Fourier transform (SRFT) instead of a Gaussian matrix. It computes a sketch $\mathbf{Y}=\mathbf{S}_{\text{sub}}\mathbf{F}\mathbf{D}\mathbf{A}$ , where $\mathbf{D}\in\mathbb{R}^{I\times I}$ is a diagonal matrix with each diagonal entry iid and equal to $+1$ or $-1$ with equal probability, $\mathbf{F}\in\mathbb{R}^{I\times I}$ is the fast Fourier transform (FFT), and $\mathbf{S}_{\text{sub}}\in\mathbb{R}^{L\times I}$ is a subsampling operator that randomly samples $L$ rows.

1.2.2 Tensor Interpolative Decomposition

Biagioni et al. (2015) consider the problem of rank reduction of a CP tensor, which they call tensor ID. Suppose $\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}$ is an $N$ -way tensor with CP decomposition (1). Computing a rank- $K$ , $K<R$ , tensor ID of $\bm{\mathscr{X}}$ amounts to finding a representation

[TABLE]

where $\mathbf{j}\in[R]^{K}$ contains $K$ unique indices. Tensor ID has many applications. For example, in various algorithms, the rank of discretized separated representations of multivariate functions grows with each iteration, requiring repeated rank reduction of CP tensors (Beylkin and Mohlenkamp, 2002, 2006). Another example is the algorithm by Reynolds et al. (2017) for finding the element of maximum magnitude in a CP tensor which also requires repeated rank reduction.

Biagioni et al. (2015) approach the tensor ID problem by considering the matrix

[TABLE]

where $\textnormal{diag}(\lambda_{1},\ldots,\lambda_{R})\in\mathbb{R}^{R\times R}$ is a diagonal matrix with entries $\lambda_{1},\ldots,\lambda_{R}$ . The tensor ID problem can now be reduced to identifying columns of $\mathbf{M}$ using matrix ID. However, when the factor matrices have no special structure, $\mathbf{M}$ has $R\tilde{I}$ elements and is therefore typically infeasible to form. One way to tackle this problem is by forming the much smaller Gram matrix $\mathbf{M}^{\top}\mathbf{M}\in\mathbb{R}^{R\times R}$ , which can be done using $O(R^{2}\sum_{n}I_{n})$ flops since

[TABLE]

compute its symmetric matrix ID, and use it to compute an ID of $\mathbf{M}$ . This approach, however, can lead to accuracy issues since the Gram matrix can be ill-conditioned, since $\kappa(\mathbf{M}^{\top}\mathbf{M})=\kappa^{2}(\mathbf{M})$ (Biagioni et al., 2015). Biagioni et al. (2015) therefore propose a randomized method which avoids the ill-conditioning issue and reduces the complexity. This is done by applying a kind of Gaussian sketch to $\mathbf{M}$ , but instead of forming a full Gaussian matrix of size $L\times\tilde{I}$ , a matrix of the form

[TABLE]

is used, where each $\bm{\Omega}^{(n)}\in\mathbb{R}^{I_{n}\times L}$ is a matrix with elements that are iid standard normal random variables. The sketch $\mathbf{Y}=\bm{\Omega}\mathbf{M}$ can then be computed efficiently without ever forming $\bm{\Omega}$ or $\mathbf{M}$ , since $y_{lr}=\lambda_{r}\prod_{n=1}^{N}\langle\bm{\omega}^{(n)}_{:l},\mathbf{a}^{(n)}_{:r}\rangle$ . Note that the elements of $\bm{\Omega}$ in (5) are not independent. This means that the theory for Gaussian matrix ID, which requires independence, cannot be used to provide guarantees for sketched matrix ID using $\bm{\Omega}$ .

1.3 Basics of CountSketch

Our proposed method uses a type of sketching called CountSketch (Charikar et al., 2004; Clarkson and Woodruff, 2017), which we now describe. Let $h:[I]\rightarrow[L]$ be a random map such that each $h(i)$ is iid and $(\forall i\in[I])(\forall l\in[L])$ $\mathbb{P}(h(i)=l)=1/L$ , let $\bm{\Phi}\in\mathbb{R}^{L\times I}$ be a matrix with $\phi_{h(i)i}=1$ and all other entries equal to 0, and let $\mathbf{D}\in\mathbb{R}^{I\times I}$ be a diagonal matrix with each diagonal entry iid and equal to $+1$ or $-1$ with equal probability. The CountSketch operator $\mathbf{S}\in\mathbb{R}^{L\times I}$ is then defined as $\mathbf{S}=\bm{\Phi}\mathbf{D}$ . Applying $\mathbf{S}$ to $\mathbf{A}\in\mathbb{R}^{I\times R}$ does the following: The matrix $\mathbf{D}$ changes the sign of each row of $\mathbf{A}$ with probability $1/2$ , and the matrix $\bm{\Phi}$ then randomly adds each row of $\mathbf{D}\mathbf{A}$ to one of $L$ target rows. Due to the special structure of $\mathbf{S}$ , it can be applied implicitly with complexity $O(\textnormal{nnz}(\mathbf{A}))$ (Clarkson and Woodruff, 2017).

Suppose $\mathbf{A}$ has the special structure $\mathbf{A}=\operatorname*{\bigodot}_{n=1}^{N}\mathbf{A}^{(n)}\in\mathbb{R}^{\tilde{I}\times R}$ , where each $\mathbf{A}^{(n)}\in\mathbb{R}^{I_{n}\times R}$ . For such matrices, there is a variant of CountSketch which allows computing the sketch of $\mathbf{A}$ without ever having to form the full matrix, which can be prohibitively large to store explicitly. This variant is called TensorSketch and is developed by Pagh (2013), Pham and Pagh (2013), Avron et al. (2014) and Diao et al. (2018). It works as follows:

•

Define $n$ independent random maps $h_{n}:[I_{n}]\rightarrow[L]$ such that each $h(i)$ is iid and $(\forall i\in[I_{n}])(\forall l\in[L])$ $\mathbb{P}(h_{n}(i)=l)=1/L$ ; and

•

define $n$ independent random sign functions $s_{n}:[I_{n}]\rightarrow\{+1,-1\}$ such that $(\forall i\in[I_{n}])$ $\mathbb{P}(s_{n}(i)=+1)=\mathbb{P}(s_{n}(i)=-1)=1/2$ .

Next, define $H:[I_{1}]\times[I_{2}]\times\cdots\times[I_{N}]\rightarrow[L]$ as

[TABLE]

and $S:[I_{1}]\times[I_{2}]\times\cdots\times[I_{N}]\rightarrow\{+1,-1\}$ as

[TABLE]

Notice that each row index of $\mathbf{A}$ corresponds to a unique $N$ -tuple $(i_{1},\ldots,i_{N})$ . $H$ and $S$ can therefore be considered functions on $[\tilde{I}]$ . With this in mind, let $\mathbf{D}_{S}\in\mathbb{R}^{\tilde{I}\times\tilde{I}}$ denote a diagonal matrix with the $i$ th diagonal entry equal to $S(i)$ . If $H$ and $\mathbf{D}_{S}$ are used instead of $h$ and $\mathbf{D}$ in the definition of CountSketch above, we get TensorSketch, which we will denote by $\mathbf{T}\in\mathbb{R}^{L\times\tilde{I}}$ . The reason for choosing this formulation is that it can be computed efficiently using the following formula:

[TABLE]

where each $\mathbf{S}^{(n)}\in\mathbb{R}^{L\times I_{n}}$ is a CountSketch operator defined using $h_{n}$ and the diagonal matrix $\textnormal{diag}(s_{n}(1),\ldots,s_{n}(I_{n}))$ . The formula (8) follows from the discussion in Section A in the supplementary material of Diao et al. (2018). Other good sources for further details on TensorSketch are Pagh (2013), Pham and Pagh (2013) and Avron et al. (2014).

2 Other Related Work

We provided an overview of existing ID algorithms in Section 1.2. The matrix ID is related to the CX and CUR decompositions (Drineas and Kannan, 2003; Drineas et al., 2006, 2008; Mahoney and Drineas, 2009; Bien et al., 2010; Wang and Zhang, 2013; Boutsidis and Woodruff, 2017), also known as skeleton approximations (Goreinov et al., 1997a, b; Tyrtyshnikov, 2000), and the column subset selection problem (Frieze et al., 2004; Deshpande and Vempala, 2006; Deshpande et al., 2006; Boutsidis et al., 2009; Deshpande and Rademacher, 2010; Guruswami and Sinop, 2012; Boutsidis et al., 2014). Like ID, the CX decomposition takes the form $\mathbf{A}\approx\mathbf{C}\mathbf{X}$ , where $\mathbf{C}$ contains a subset of the columns of $\mathbf{A}$ . The crucial feature that distinguishes ID from a CX decomposition is the additional conditioning requirements on the coefficient matrix $\mathbf{P}$ in ID; the matrix $\mathbf{X}$ in a CX decomposition is not required to have the properties (i)–(iv) listed in Fact 1 (Drineas et al., 2008). A CUR decomposition takes the form $\mathbf{A}\approx\mathbf{C}\mathbf{U}\mathbf{R}$ , where $\mathbf{C}$ and $\mathbf{R}$ contain a subset of the columns and rows of $\mathbf{A}$ , respectively. Consequently, setting $\mathbf{X}=\mathbf{U}\mathbf{R}$ would yield a CX decomposition. It is well-known that the matrix $\mathbf{X}$ defined in this manner is typically ill-conditioned (Voronin and Martinsson, 2017). Since we require the coefficient matrix $\mathbf{P}$ in our decomposition to be well-conditioned, the available algorithms for CX and CUR decomposition are not useful to us.

Various randomized algorithms have been utilized in the context of tensor decomposition before. Examples include the works of Wang et al. (2015), Battaglino et al. (2018), and Yang et al. (2018) for the CP decomposition; Drineas and Mahoney (2007), Tsourakakis (2010), da Costa et al. (2016) and Malik and Becker (2018) for the Tucker decomposition; and Zhang et al. (2018) and Tarzanagh and Michailidis (2018) for t-product based decompositions. Other notable works that use CUR-type algorithms or sampling are e.g. those by Mahoney et al. (2008), Caiafa and Cichocki (2010), Oseledets et al. (2008) and Friedland et al. (2011). The tensor ID which we consider is different from the various problems solved in these previous papers. The goal of tensor ID is not to compute a tensor decomposition from an arbitrary data tensor. Instead, the purpose of tensor ID is to compress a tensor which is already in CP format in an efficient and principled manner. To the best of our knowledge, the only work aside from that by Biagioni et al. (2015) which considers randomized tensor ID is the paper by Reynolds et al. (2016). They introduce a randomized alternating least-squares (ALS) algorithm, which is better conditioned but slower than the standard ALS algorithm for CP decomposition. Biagioni et al. (2015) conclude that standard ALS is much slower than their Gaussian sketching algorithm. We therefore do not compare our proposed tensor ID to the randomized ALS by Reynolds et al. (2016) since it is even slower.

To the best of our knowledge, TensorSketch was the first sketch with theoretical guarantees that could be applied particularly efficiently to matrices like $\mathbf{M}$ in (3) with Kronecker structured columns. Recently, a number of works have appeared that provide guarantees for other methods designed for efficient sketching of Kronecker structured vectors. Sun et al. (2018) consider sketches of the form (5) where each $\bm{\Omega}^{(n)}$ has sub-Gaussian entries. They provide Johnson–Lindenstrauss (JL) style guarantees for the case when the sketch is a Khatri–Rao product of two smaller matrices, i.e., for $N=2$ in (5). Rakhshan and Rabusseau (2020) consider sketches which have tensor train or CP tensor structure, and with core tensors and factor matrices that have Gaussian entries. They provide JL style guarantees for these sketches for arbitrary orders of the random tensors. Their sketch with CP tensor structure includes the sketch in (5) as a special case. Another line of work considers the Kronecker fast JL transform, which is a structured variant of the fast JL transform of Ailon and Chazelle (2009). It was first proposed by Battaglino et al. (2018) with theoretical guarantees later provided by Jin et al. (2019) and Malik and Becker (2020). A variant of this transform is also considered by Iwen et al. (2019).

3 Fast Randomized Matrix ID Using CountSketch

Algorithm 2 explains our proposal for CountSketch matrix ID. Proposition 1 provides guarantees for the method. A proof is provided in Section 7.1, which also contains a more detailed version of the bound in (10).

Proposition 1 (CountSketch matrix ID)

Suppose $I$ , $R$ and $K<R$ are defined as in Algorithm 2. Let $\beta>1$ be a real number and $L$ a positive integer such that

[TABLE]

Suppose that the matrix ID on line 5 of Algorithm 2 utilizes SRRQR. Then, the output $\mathbf{P}$ of Algorithm 2 satisfies properties (i)–(iv) in Fact 1. Moreover, the outputs $[\mathbf{P},\mathbf{j}]$ satisfy

[TABLE]

with probability at least $1-\frac{1}{\beta}$ .

The condition in (9) is very similar to that for SRFT matrix ID by Woolfe et al. (2008). The only difference is that instead of a term of the form $(K^{2}+K)$ , their work only has a factor $K^{2}$ . In practice, the condition in (9) is very conservative. We find that a small oversampling factor, e.g. $L=K+10$ , works well in practice, producing errors of the same size as the other randomized ID methods.

Remark 1

The semi-coherent matrices defined by Avron et al. (2010) are adversarial to CountSketch, and therefore to our proposed method. For such matrices, using $L=K+10$ may result in a large error for our method. Some care is therefore necessary when applying our method together this choice of $L$ . In Section 6.1.2, we do extensive testing of our method on real-world matrices to demonstrate that using $L=K+10$ works well in practice. We also provide an example of a matrix with semi-coherent structure on which our method fails when choosing $L$ like this.

Remark 2

In cases when the target rank $K$ is quite large (e.g. $K=R/2$ ), an issue we encountered is that $\mathbf{S}\mathbf{A}$ can be rank deficient due to rank deficiency of $\mathbf{S}$ . This issue can be dealt with easily by defining $\mathbf{S}$ slightly differently to ensure that each row contains at least one nonzero element. This is done by the following straightforward modification of the map $h$ in the definition of CountSketch in Section 1.3: Let each $h(i)=v_{i}$ , where $\mathbf{v}\in\mathbb{R}^{I}$ is a uniform random permutation of the elements of the vector $[1,\cdots,L,x_{L+1},\cdots,x_{I}]$ , where each $x_{i}\in[L]$ is iid uniformly random. With this modification, the guarantees of Proposition 1 still hold. In fact, the condition in (9) is slightly improved. We give a precise statement with proof in Section 7.2.

4 Extending the Results to Tensor ID

Let $\bm{\mathscr{X}}$ and $\hat{\bm{\mathscr{X}}}$ be defined as in (1) and (2), respectively. Our approach to the tensor ID problem is similar to that of Biagioni et al. (2015): We sketch the matrix $\mathbf{M}$ in (3) without forming it and compute a matrix ID of this sketch. The approximation $\hat{\bm{\mathscr{X}}}$ is then constructed using the rank-1 components of $\bm{\mathscr{X}}$ corresponding to the columns of $\mathbf{M}$ used in the ID of that matrix. The s-values $\hat{\lambda}_{1},\ldots,\hat{\lambda}_{K}$ used in the representation of $\hat{\bm{\mathscr{X}}}$ are then computed as $\hat{\lambda}_{k}=\lambda_{j_{k}}\sum_{r=1}^{R}p_{kr}$ , for $k\in[K]$ . The sketch we use is the efficient TensorSketch variant of CountSketch. Algorithm 3 outlines our proposed method for tensor ID. Proposition 2 provides guarantees for the method. A proof is provided in Section 7.3. To the best of our knowledge, there are no previous results like Proposition 2 for randomized tensor ID.

Proposition 2 (TensorSketch tensor ID)

Suppose $I_{1},\ldots,I_{N}$ , $R$ and $K<R$ are defined as in Algorithm 3. Let $\beta>1$ be a real number and $L$ a positive integer such that $2(2+3^{N})\beta K^{2}\leq L<\tilde{I}$ . Suppose that the matrix ID on line 6 of Algorithm 3 utilizes SRRQR. Then, the output of Algorithm 3 satisfies

[TABLE]

with probability at least $1-\frac{1}{\beta}$ .

As mentioned in Section 1.2.2, an issue with forming and then decomposing $\mathbf{M}^{\top}\mathbf{M}$ is that it can be ill-conditioned. Biagioni et al. (2015) point out that the sketched matrix $\bm{\Omega}\mathbf{M}$ typically is much better conditioned since $\kappa(\bm{\Omega}\mathbf{M})\leq\kappa(\bm{\Omega})\kappa(\mathbf{M})$ and Gaussian matrices are well-conditioned. As Proposition 3 demonstrates, the matrix $\mathbf{T}\mathbf{M}$ is also well-conditioned with high probability, when the sketch dimension $L$ is sufficiently large. A proof of Proposition 3 is provided in Section 7.4.

Proposition 3

Let $\beta>1$ be a real number, and let $L,R$ and $I_{1},\ldots,I_{N}$ be positive integers such that $2(2+3^{N})\beta R^{2}\leq L$ . Suppose $\mathbf{T}\in\mathbb{R}^{L\times\tilde{I}}$ is a TensorSketch matrix, and $\mathbf{M}\in\mathbb{R}^{\tilde{I}\times R}$ is an arbitrary matrix. Then ${\kappa(\mathbf{T}\mathbf{M})\leq 7\kappa(\mathbf{M})}$ with probability at least $1-\frac{1}{\beta}$ .

5 Complexity Analysis

In this section, we compare the complexity of our proposed methods with the other algorithms. We assume all QR factorizations are done using column pivoted QR instead of SRRQR, and ignore the cost of generating random variables. We also assume that $L=K+C$ where $C$ is a small positive integer (e.g. $L=K+10$ ) since this choice works well in practice. Since $K<R$ , and we assume $L=K+C$ for a small constant $C$ , we also make the assumption $L<R$ .

The costs of the different steps of Algorithm 2 are as follows:

•

Computing the sketch $\mathbf{Y}=\mathbf{S}\mathbf{A}$ : $O(\textnormal{nnz}(\mathbf{A}))$ .

•

Computing Matrix ID of $\mathbf{Y}\in\mathbb{R}^{L\times R}$ , where $L<R$ : $O(L^{2}R)$ .

The total cost is therefore $O(\textnormal{nnz}(\mathbf{A})+K^{2}R)$ . The cost of standard matrix ID can be found in Remark 3 of Cheng et al. (2005), and the cost of SRFT matrix ID can be found in Remark 5.4 of Woolfe et al. (2008). The cost of Gaussian matrix ID is straightforward to compute similarly to our computation above. Table 1 summarize these matrix ID complexities.

For the tensor ID algorithms, we assume the input is an $N$ -way rank- $R$ CP tensor of size $I\times\cdots\times I$ , and that each factor matrix has the same number of nonzeros, which we denote by $\textnormal{nnz}(\mathbf{A})$ . The costs of the different steps of Algorithm 3 are as follows:

•

Computing the TensorSketched matrix $\mathbf{Y}$ : $O(N(\textnormal{nnz}(\mathbf{A})+RL\log L))$ .

•

Computing Matrix ID of $\mathbf{Y}\in\mathbb{R}^{L\times R}$ , where $L<R$ : $O(L^{2}R)$ .

•

Computing $\hat{\lambda}_{1},\ldots,\hat{\lambda}_{K}$ : $O(RK)$ .

The total cost is therefore $O(N(\textnormal{nnz}(\mathbf{A})+RK\log K)+K^{2}R)$ . Although Biagioni et al. (2015) do not specify these, the complexities for the Gram matrix approach and Gaussian tensor ID can be computed from the descriptions in their paper. Table 2 summarize the complexities for the different tensor ID algorithms. The constant $C_{\text{mult}}$ is the cost of computing one Gram matrix $\mathbf{A}^{(n)\top}\mathbf{A}^{(n)}$ in (4), which we assume is the same for each $n$ , e.g. $C_{\text{mult}}=IR^{2}$ if the factor matrices were dense.

6 Numerical Experiments

The numerical experiments are done in Matlab R2018b and C. All results are averages over ten runs in an environment using four cores of an Intel Xeon E5-2680 v3 @2.50GHz CPU and 19 GB of RAM. All code used to generate our results can be found at https://github.com/OsmanMalik/countsketch-matrix-tensor-id, including implementations of our proposed methods. For all randomized methods, we use an oversampling parameter equal to 10 (i.e., $L=K+10$ in Algorithms 2 and 3).

6.1 Matrix ID Experiments

We compare the four methods in Table 1. For standard matrix ID, we use the implementation in RSVDPACK111Available at

https://github.com/sergeyvoronin/LowRankMatrixDecompositionCodes.. For the remaining methods, we use our own Matlab implementations which utilize Matlab’s column pivoted QR function. RSVDPACK only supports dense matrices. Moreover, since it is challenging to efficiently construct partial QR decompositions of sparse matrices, we did not attempt to write our own implementation of standard matrix ID for sparse matrices; see Section 11.1.8 of Golub and Van Loan (2013) for a discussion about the challenges of sparse QR. We therefore have to convert each sparse input matrix to dense format before applying standard matrix ID from RSVDPACK. Similarly, it is challenging to implement an efficient algorithm for FFT for sparse matrices, or for the accelerated FFT by Woolfe et al. (2008). We therefore also have to convert the input matrix to dense format before applying standard FFT in our implementation of SRFT matrix ID. However, by only sketching a subset of columns of the input matrix at a time, we can avoid having to convert all columns of the matrix to dense format at the same time. In the experiments, we use the modification described in Remark 2 when implementing our proposed CountSketch matrix ID.

Computing the spectral norm of the matrices we consider is not feasible due to their size. Therefore, when computing the error for each matrix decomposition, we utilize the randomized scheme for estimating the spectral norm suggested in Section 3.4 of Woolfe et al. (2008). Letting $E$ be the true error in spectral norm, our estimates $\tilde{E}$ satisfy the following properties: $\tilde{E}\leq E$ , and $\mathbb{P}(\tilde{E}\geq E/100)=1-q$ where $0<q\ll 1$ . In other words, the estimate is smaller than the true spectral norm, but it is unlikely to be much smaller (with “much smaller” meaning more than two orders of magnitude smaller). This is good enough for our purposes, since we are primarily interested in comparing the performance of the different methods rather than establishing the exact errors. In the first experiment, $q<\text{2e\textminus 2}$ , and in the second $q<\text{2e\textminus 5}$ .

6.1.1 Experiment 1: Synthetic Matrices

We generate sparse matrices $\mathbf{A}\in\mathbb{R}^{I\times R}$ with $R=\text{1e+4}$ , and density $\textnormal{nnz}(\mathbf{A})/(IR)\approx 0.5\%$ . We use different values of $I\in[\text{1e+4},\text{1e+6}]$ . The matrices have a true rank of $2K$ , where $K=\text{1e+3}$ . Similarly to experiments by Martinsson et al. (2011), we let $\sigma_{i}(\mathbf{A})$ , $i\in[K]$ , decay exponentially to $10^{-8}$ , and then remain constant at $\sigma_{i}(\mathbf{A})\approx 10^{-8}$ , $i\in[K+1:2K]$ .

The results for the first experiment are presented in Figure 1. Standard matrix ID encountered memory issues when $I\geq\text{5e+4}$ . For the matrix sizes the standard method could handle, it was more accurate but much slower than the randomized methods. The accuracy of all randomized methods is comparable. Our proposed CountSketch matrix ID is the fastest, achieving a speed-up of about $18\times$ and $12\times$ when $I=\text{1e+6}$ compared to Gaussian and SRFT ID, respectively.

6.1.2 Experiment 2: Real-World Matrices

We decompose a sparse matrix which comes from a computer vision problem and is part of the SuiteSparse Matrix Collection222The matrix can be downloaded from https://sparse.tamu.edu/Brogan/specular.. The matrix is of size 477,976 by 1,600, contains 7,647,040 nonzero elements, and has a rank of 1,442. We set the target rank to $K=\text{1,442}$ . Ideally, the methods should be able to produce decompositions with a very small error. We only attempt this with the three randomized methods, since the matrix is too large for standard matrix ID. Table 3 shows the result. All methods produce good approximations with a small error. Our proposed CountSketch matrix ID method is much faster than the other algorithms, achieving a speed-up of about $35\times$ and $31\times$ compared to Gaussian and SRFT matrix ID, respectively.

To further support our claim that $L=K+10$ works well in practice, we have done additional experiments. We consider 20 matrices from the SuiteSparse Matrix Collection333They are landmark, Franz7, ch7-8-b2, ch7-9-b2, ch8-8-b2, mk12-b2, shar_te2-b1, rel7, relat7b, relat7, abtaha2, abtaha1, specular, photogrammetry2, GL7d12, ch7-6-b2, ch7-7-b2, cis-n4c6-b3, mk11-b2, n4c6-b3. , and 3 different target ranks (10%, 50% and 90% of the number of columns). The matrices are of different sizes and come from different application areas. We compare the performance of Gaussian, SRFT and CountSketch (proposed method) matrix ID, all with $L=K+10$ , repeating each experiment 10 times and reporting averages. The results are in Table 4; on average, our method is the most accurate even compared to the Gaussian method which it outperforms in 34 of the 60 tests. Our method outperforms the SRFT method in 57 of the 60 tests. Out of the 26 cases when the Gaussian method outperforms our method, the difference is no more than 7% in 25 of those cases, and 31% in one case. Out of the 58 cases when the Gaussian method outperforms the SRFT method, the difference is no more than 14% in 56 of those cases, and 36%–45% in two cases.

As mentioned in Remark 1, semi-coherent matrices are adversarial to CountSketch and our proposed method. The matrix soc-sign-bitcoin-otc in the SuiteSparse Matrix Collection, which is the adjacency matrix of a graph, is a concrete example of when choosing $L=K+10$ results in a large error for our method. The semi-coherent structure of this matrix can be revealed by rearranging it so that rows and columns corresponding to nodes in the same strongly connected components of the graph are adjacent in the matrix. Some care is therefore necessary when applying our method together with the rule of thumb $L=K+10$ .

6.2 Tensor ID Experiments

We compare the three methods in Table 2. We have implemented all methods ourselves in Matlab and C.

6.2.1 Experiment 1: Synthetic Tensors

We generate sparse $5$ -way tensors $\bm{\mathscr{X}}\in\mathbb{R}^{I\times\cdots\times I}$ using (1), where each factor matrix column $\mathbf{a}_{:r}^{(n)}$ is a random sparse vector with a density of 1%, and we use different values of $I\in[\text{1e+3},\text{1e+5}]$ . The number of rank-1 terms is $R=\text{10,000}$ , and we use a target rank of 1,000. The values of $\lambda_{r}$ in (1) are defined as $\lambda_{r}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}10^{-\frac{r-1}{R}8}$ for $r\in[1000]$ , and $\lambda_{r}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}10^{-8}$ for $r\in[1001:R]$ . The results for the experiment are presented in Figure 2. Gaussian and CountSketch tensor ID achieve similar accuracy. Although the Gram matrix approach has a better accuracy here, it can have issues reaching an error below the square root of machine precision due to poor conditioning; see the example in Section 5.1.1 of Biagioni et al. (2015). Our proposed method is much faster than both other methods for the larger tensors, achieving a speed-up of $46\times$ over the Gram matrix approach (for $I=\text{2.5e+4}$ ) and $14\times$ over Gaussian tensor ID (for $I=\text{1e+5}$ ).

6.2.2 Experiment 2: Real-World Tensor

The purpose of this experiment is to show how tensor ID can be useful in a data analysis task. We implement Algorithm 2 by Reynolds et al. (2017), which requires repeated rank reduction, and use it to find the maximum magnitude element in a CP tensor which comes from decomposing streamed data. The rank reduction step is done using tensor ID. The data we consider is a decomposed version of the Enron data set444The data set is available at http://frostt.io/tensors/enron. of size $\text{6,066}\times\text{5,699}\times\text{244,268}\times\text{1,176}$ . The Enron data set keeps track of email correspondence between employees at Enron, and the four modes represent sender, receiver, keyword and date. The decomposition has rank 100, and was constructed using the streamed version of SPLATT555The streamed version of SPLATT is available at https://github.com/ShadenSmith/splatt-stream. (Smith et al., 2018), with the data streamed along the fourth mode (time). As suggested in the documentation of SPLATT-stream, we apply an additional Frobenius norm regularizer with regularization coefficient 1e−2 to the mode-4 factor matrix. We threshold the factor matrices outputted by SPLATT-stream by first normalizing them so that each column have unit 2-norm (the normalization constant is absorbed into the s-values) and then setting all elements with magnitude less than 1e−6 to zero. The relative error introduced by this thresholding is less than 2e−5.

Unlike the previous experiments, the matrices being sketched in this experiment have many rows containing only zeros. We therefore could speed up Gaussian tensor ID by only generating those columns of the Gaussian sketch matrices which are actually multiplied by nonzero elements. We used this improved version of Gaussian tensor ID in the experiment for a more fair comparison. The same modification does not yield a speed-up of Gaussian matrix or tensor ID in the previous experiments since there most rows of the matrices being sketched contain nonzero elements.

Finding the maximum magnitude element using a brute force approach would require computing every nonzero element in the tensor, which would be costly. Using the algorithm by Reynolds et al. (2017) together with our CountSketch tensor ID, we find the maximum in 11 seconds. The sketching portion of the algorithm takes $2.6\times$ more time if Gaussian tensor ID is used instead. We do not compare with the Gram matrix approach since it takes very long to run. With the results in the previous subsection in mind, we believe the speed-up would be more substantial for higher rank tensors. For all ten trials, and both when using CountSketch and Gaussian tensor ID for rank reduction, the same position for the maximum magnitude element is identified each time.

7 Proofs

7.1 Proof of Proposition 1

Our proof of Proposition 1 is an adaption of the proof for SRFT matrix ID provided by Woolfe et al. (2008). We show that their arguments hold when a CountSketch matrix is used for sketching instead of an SRFT matrix. Although much of our proof is identical to that provided by Woolfe et al. (2008), we choose to include it in detail. The reason for doing this is that the proofs of Propositions 2 and 4 rely on adapting the proof in the present section. Having a detailed proof here therefore makes those subsequent proofs easier to follow.

The following facts will be useful in the proof.

Fact 2 (Lemma 3.7 in Martinsson et al. (2011))

Let $I$ and $R$ be positive integers with $I\geq R$ . Suppose $\mathbf{A}\in\mathbb{R}^{I\times R}$ is a matrix such that $\mathbf{A}^{\top}\mathbf{A}$ is invertible. Then

[TABLE]

Fact 3 (Lemma 3.7 in Woolfe et al. (2008))

Let $K$ , $L$ , $I$ and $R$ be positive integers such that $K\leq R$ . Suppose $\mathbf{A}\in\mathbb{R}^{I\times R}$ , $\mathbf{B}\in\mathbb{R}^{I\times K}$ is a matrix whose columns constitute a subset of the columns of $\mathbf{A}$ , $\mathbf{P}\in\mathbb{R}^{K\times R}$ , $\mathbf{X}\in\mathbb{R}^{I\times L}$ , and $\mathbf{S}\in\mathbb{R}^{L\times I}$ . Then

[TABLE]

Fact 4 (Lemma 3.9 in Martinsson et al. (2011))

Let $L$ , $I$ and $R$ be positive integers. Suppose $\mathbf{A}\in\mathbb{R}^{I\times R}$ , and $\mathbf{S}\in\mathbb{R}^{L\times I}$ . Then $\sigma_{j}(\mathbf{S}\mathbf{A})\leq\|\mathbf{S}\|\sigma_{j}(\mathbf{A})$ for all $j\in[\min(L,I,R)]$ .

Fact 5 is a special case of a more general statement in Atkinson and Han (2009).

Fact 5 (Theorem 2.3.1 of Atkinson and Han (2009))

Let $\mathbf{L}\in\mathbb{R}^{K\times K}$ be a matrix and assume $\|\mathbf{L}\|<1$ . Then $(\mathbf{I}^{(K)}-\mathbf{L})$ is invertible and

[TABLE]

Lemma 1 is an adaption of Lemma 4.2 by Woolfe et al. (2008).

Lemma 1

Let $K$ , $L$ and $I$ be positive integers such that $K\leq I$ . Suppose $\mathbf{S}=\bm{\Phi}\mathbf{D}\in\mathbb{R}^{L\times I}$ is a CountSketch matrix, and $\mathbf{U}\in\mathbb{R}^{I\times K}$ is a matrix with orthonormal columns. Define $\mathbf{C}\in\mathbb{R}^{K\times K}$ as

[TABLE]

and define $\mathbf{E}\in\mathbb{R}^{K\times K}$ elementwise as

[TABLE]

Then $\mathbf{C}=\mathbf{I}^{(K)}+\mathbf{E}$ .

Proof

For $k,k^{\prime}\in[K]$ ,

[TABLE]

Since

[TABLE]

we can rewrite (17) as

[TABLE]

The second term on the last line in the equation above is just $e_{kk^{\prime}}$ . Since

[TABLE]

and $d_{ii}^{2}=1$ , the first term is just

[TABLE]

It follows that $\mathbf{C}=\mathbf{I}^{(K)}+\mathbf{E}$ . ∎

Lemma 2 is an adaption of Lemma 4.3 by Woolfe et al. (2008).

Lemma 2

Let $\alpha$ and $\beta$ be real numbers such that $\alpha,\beta>1$ , and let $K$ , $L$ and $I$ be positive integers such that

[TABLE]

Suppose $\mathbf{S}=\bm{\Phi}\mathbf{D}\in\mathbb{R}^{L\times I}$ is a CountSketch matrix, $\mathbf{U}\in\mathbb{R}^{I\times K}$ is a matrix with orthonormal columns, and $\mathbf{E}\in\mathbb{R}^{K\times K}$ is the matrix defined in (16). Then

[TABLE]

with probability at least $1-\frac{1}{\beta}$ .

Proof

Using the definition in (16), we have

[TABLE]

Note that for each term in the sum above, $i\neq i^{\prime}$ and $j\neq j^{\prime}$ . This means that unless ( $i=j$ and $i^{\prime}=j^{\prime}$ ) or ( $i=j^{\prime}$ and $i^{\prime}=j$ ), we have

[TABLE]

since each $d_{ii}$ is independent from all other random variables, and since $\mathbb{E}[d_{ii}]=0$ for all $i\in[I]$ . We can therefore rewrite (24) as

[TABLE]

The matrix $\bm{\Phi}$ has exactly one nonzero entry which is equal to 1 in each column. Consequently,

[TABLE]

The event $h(i)=h(i^{\prime})$ happens with probability $\frac{1}{L}$ when $i\neq i^{\prime}$ . If follows that

[TABLE]

Using this fact, and the fact that each $d_{ii}^{2}=1$ , (26) simplifies to

[TABLE]

Note that

[TABLE]

Moreover,

[TABLE]

Combining (29), (30) and (LABEL:eq:lemma-2-5) yields

[TABLE]

Since

[TABLE]

we have

[TABLE]

Using Markov’s inequality and the condition in (22), we have

[TABLE]

Consequently,

[TABLE]

∎

Lemma 3 is an adaption of Lemma 4.4 by Woolfe et al. (2008).

Lemma 3

Let $\alpha$ , $\beta$ , $K$ , $L$ and $I$ satisfy the same properties as in Lemma 2. Furthermore, suppose $\mathbf{S}$ , $\mathbf{U}$ , and $\mathbf{E}$ are defined as in Lemma 2, and let $\mathbf{C}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}(\mathbf{S}\mathbf{U})^{\top}(\mathbf{S}\mathbf{U})$ . If (23) is true, then the following hold:

[TABLE]

$\mathbf{C}$ * is invertible, and*

[TABLE]

Proof

Using Lemma 1 and (23), we then have

[TABLE]

Since $\mathbf{C}=\mathbf{I}^{(K)}+\mathbf{E}$ and $\|\mathbf{E}\|<1$ , it follows from Fact 5 that $\mathbf{C}$ is invertible and

[TABLE]

where the last inequality follows from (23). Consequently,

[TABLE]

∎

Lemma 4 is an adaption of Lemma 4.5 by Woolfe et al. (2008).

Lemma 4

Let $L$ and $I$ be positive integers with $L<I$ . Suppose $\mathbf{S}\in\mathbb{R}^{L\times I}$ is a CountSketch matrix. Then $\|\mathbf{S}\|\leq\sqrt{I}$ .

Proof

The matrix $\mathbf{S}$ contains $I$ nonzero elements, all of magnitude 1. It follows that $\|\mathbf{S}\|_{\textup{F}}^{2}=I$ , and hence $\|\mathbf{S}\|\leq\|\mathbf{S}\|_{\textup{F}}\leq\sqrt{I}$ . ∎

Lemma 5 is an adaption of Lemma 4.6 by Woolfe et al. (2008).

Lemma 5

Let $\alpha$ , $\beta$ , $K$ , $L$ and $I$ satisfy the same properties as in Lemma 2. Suppose $\mathbf{S}\in\mathbb{R}^{L\times I}$ is a CountSketch matrix, and $\mathbf{A}\in\mathbb{R}^{I\times R}$ is an arbitrary matrix. Then, with probability at least $1-\frac{1}{\beta}$ , there exists a matrix $\mathbf{X}\in\mathbb{R}^{I\times L}$ such that

[TABLE]

and

[TABLE]

Proof

Let $\mathbf{A}=\mathbf{U}\bm{\Sigma}\mathbf{V}^{\top}$ be the SVD of $\mathbf{A}$ , where $\mathbf{U}\in\mathbb{R}^{I\times I}$ and $\mathbf{V}\in\mathbb{R}^{R\times R}$ are unitary, and $\bm{\Sigma}\in\mathbb{R}^{I\times R}$ is diagonal with non-negative entries. Split $\mathbf{U}$ into two matrices $\mathbf{U}^{(1)}\in\mathbb{R}^{I\times K}$ and $\mathbf{U}^{(2)}\in\mathbb{R}^{I\times(I-K)}$ so that $\mathbf{U}=\begin{bmatrix}\mathbf{U}^{(1)}&\mathbf{U}^{(2)}\end{bmatrix}$ . Let $\mathbf{Z}^{(1)}=\mathbf{S}\mathbf{U}^{(1)}\in\mathbb{R}^{L\times K}$ and $\mathbf{Z}^{(2)}=\mathbf{S}\mathbf{U}^{(2)}\in\mathbb{R}^{L\times(I-K)}$ . Then

[TABLE]

Define $\mathbf{C}=\mathbf{Z}^{(1)\top}\mathbf{Z}^{(1)}\in\mathbb{R}^{K\times K}$ and let $\mathbf{E}$ be the corresponding matrix defined in (16), but in terms of $\mathbf{U}^{(1)}$ instead of $\mathbf{U}$ . Then $\mathbf{C}=\mathbf{I}^{(K)}+\mathbf{E}$ according to Lemma 1. For the remainder of the proof, we will assume that $\|\mathbf{E}\|\leq 1-\frac{1}{\alpha}$ , which happens with probability at least $1-\frac{1}{\beta}$ according to Lemma 2. Then $\mathbf{C}$ is invertible according to Lemma 3. Define $\mathbf{G}^{(-1)}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\mathbf{C}^{-1}\mathbf{Z}^{(1)\top}=(\mathbf{Z}^{(1)\top}\mathbf{Z}^{(1)})^{-1}\mathbf{Z}^{(1)\top}\in\mathbb{R}^{K\times L}$ and

[TABLE]

According to Fact 2 and Lemma 3, it follows that

[TABLE]

Combining (45) and (46), we have

[TABLE]

So (43) is satisfied. Next, let $\bm{\Theta}\in\mathbb{R}^{K\times K}$ and $\bm{\Psi}\in\mathbb{R}^{(I-K)\times(I-K)}$ be the matrices in the upper left and lower right corners of $\bm{\Sigma}$ , respectively, so that

[TABLE]

It is easy to verify that

[TABLE]

Using (48), we can further rewrite

[TABLE]

Note that

[TABLE]

From (48), we know that

[TABLE]

Moreover, using (44), the fact that $\mathbf{U}$ is unitary, and Lemma 4, we have

[TABLE]

Combining (49), (50), (51), (52), (53) and (46) we have

[TABLE]

which proves (42). ∎

We can now prove Proposition 1 in the main manuscript. The proof is an adaption of the discussion in Section 5.1 of Woolfe et al. (2008).

Proof (Proposition 1)

According to Fact 1, the outputs $\mathbf{P}$ and $\mathbf{j}$ computed on line 5 of Algorithm 2 satisfy the following: $\mathbf{P}$ satisfies properties (i)–(iv) in Fact 1, including

[TABLE]

and

[TABLE]

since $K\leq\min(L,R)$ . Applying Fact 3, we have

[TABLE]

where $\mathbf{X}\in\mathbb{C}^{I\times L}$ is an arbitrary matrix. From Lemma 5, with probability at least $1-\frac{1}{\beta}$ , we can choose $\mathbf{X}$ such that the bounds in (42) and (43) hold. Moreover, since $\mathbf{Y}=\mathbf{S}\mathbf{A}$ , it follows that $\mathbf{Y}_{:\mathbf{j}}=\mathbf{S}\mathbf{A}_{:\mathbf{j}}$ , and consequently,

[TABLE]

Combining (42), (43), (55), (56), (57), (58), and Fact 4 gives that

[TABLE]

with probability at least $1-\frac{1}{\beta}$ . Setting $\alpha=4$ then yields the same bounds as in the statement in Proposition 1. ∎

7.2 Formal Statement and Proof of Claim in Remark 2

We express the statement in Remark 2 in slightly different terms here. Let ${f:[I]\rightarrow[L]}$ be a hybrid deterministic/random function defined as

[TABLE]

where all $x_{i}$ are iid random variables that are uniformly distributed in $[L]$ . Furthermore, let $\pi:[I]\rightarrow[I]$ be a uniform random permutation function. We then define $\tilde{h}:[I]\rightarrow[L]$ as $\tilde{h}(i)\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}f(\pi(i))$ . Using $\tilde{h}$ instead of $h$ in the definition of CountSketch ensures that $\mathbf{S}$ is of full rank. The guarantees of Proposition 1 still hold for this modified CountSketch, and in fact the bound in (9) is slightly improved.

Proposition 4

If $\tilde{h}$ defined in this way is used instead of $h$ when defining $\mathbf{S}$ on line 3 in Algorithm 2, then Proposition 1 still holds, but with the condition in (9) improved to

[TABLE]

We have not seen anyone else consider this kind of modified CountSketch.

Proof (Proposition 4)

When using the modified CountSketch matrix proposed in Remark 2, the only thing that will change in the proof in Section 7.1 is Lemma 2. Notice that going from $h$ to $\tilde{h}$ only impacts $\bm{\Phi}$ and not $\mathbf{D}$ , and since $\bm{\Phi}$ and $\mathbf{D}$ remain independent, the argument that takes us from (24) to (26) remains valid for $\tilde{h}$ . Indeed, the key conditions that each $d_{ii}$ is independent from all other random variables and $\mathbb{E}[d_{ii}]=0$ remain true when we use $\tilde{h}$ instead of $h$ . However, the expectation in (28) will change, which impacts (29), due to the fact that the probability of the event $\tilde{h}(i)=\tilde{h}(i^{\prime})$ when $i\neq i^{\prime}$ is not $\frac{1}{L}$ . Note that

[TABLE]

We can rewrite

[TABLE]

Notice that the first term on the right hand side of (63) is zero, since $f$ then will map $\pi(i)$ and $\pi(i^{\prime})$ to distinct elements. The second and third term in (63) are equal. Considering the second term, we have

[TABLE]

since if $\pi(i^{\prime})\in[L]$ , then $f(\pi(i^{\prime}))=l$ if and only if $\pi(i^{\prime})=l$ . Furthermore,

[TABLE]

where the second equality is true since each $x_{i}$ , $i\in[L+1:I]$ is iid. For the fourth term in the right hand side of (63), we have

[TABLE]

where the second equality again holds since each $x_{i}$ , $i\in[L+1:I]$ is iid and $\pi(i)\neq\pi(i^{\prime})$ . Combining (62), (63), (64), (65), (66), and using the fact that the second and third term in (63) are equal, we get

[TABLE]

Proceeding with the remainder of the proof of Lemma 2 as before, we now get a bound

[TABLE]

Using this new bound and the new condition

[TABLE]

together with Markov’s inequality, we get

[TABLE]

and consequently

[TABLE]

holds in this case too.

All the other lemmas will remain the same, with the only exception that Lemmas 3 and 5 now will use the new condition in (69) instead of the old one in (22). The proof of the proposition itself at the end of Section 7.1 will therefore remain identical. When using the modified CountSketch, the statements in Proposition 1 will therefore remain true with the new condition in (69). Setting $\alpha=4$ then yields the desired bound. ∎

7.3 Proof of Proposition 2

The following fact will be useful.

Fact 6 (Theorem B.1 in the supplement of Diao et al. (2018))

Recall that $\tilde{I}=I_{1}\cdots I_{N}$ . Let $\mathbf{T}\in\mathbb{R}^{L\times\tilde{I}}$ be a TensorSketch operator defined as in Section 1.3 in terms of $N$ CountSketch operators.

(i)

Suppose $\mathbf{A}$ and $\mathbf{B}$ are matrices with $\tilde{I}$ rows. For $L\geq(2+3^{N})/(\varepsilon^{2}\delta)$ , we have

[TABLE] 2. (ii)

Suppose $\mathbf{M}\in\mathbb{R}^{\tilde{I}\times R}$ is any matrix. If $L\geq R^{2}(2+3^{N})/(\varepsilon^{2}\delta)$ , then the following holds with probability at least $1-\delta$ :

[TABLE]

We break the proof into two parts. First, we prove Lemma 6 which is a variant of Proposition 1 for the case when a TensorSketch operator $\mathbf{T}\in\mathbb{R}^{L\times\tilde{I}}$ is used instead of a CountSketch operator. Then we prove the proposition itself.

Lemma 6

Let $\alpha$ and $\beta$ be real numbers such that $\alpha,\beta>1$ , and let $K$ , $L$ , $R$ and $I_{1},\ldots,I_{N}$ be positive integers such that $K\leq R$ and

[TABLE]

Suppose that the matrix ID on line 6 of Algorithm 3 utilizes SRRQR. Then the outputs $\mathbf{P}$ and $\mathbf{j}$ on that line will satisfy

[TABLE]

with probability at least $1-\frac{1}{\beta}$ .

Proof

Recall from Section 1.3 that TensorSketch is defined similarly to CountSketch, but using the hash function $H$ instead of $h$ , and using the diagonal matrix $\mathbf{D}^{(S)}$ instead of $\mathbf{D}$ . Letting $\bm{\Phi}^{(H)}\in\mathbb{R}^{L\times(I_{1}\cdots I_{N})}$ be a matrix with $\phi^{(H)}_{H(i)i}=1$ for $i\in[I_{1}\cdots I_{N}]$ , and with all other entries equal to 0, we can write $\mathbf{T}=\bm{\Phi}^{(H)}\mathbf{D}^{(S)}$ . This means that the proof in Section 7.1 largely can be repeated to prove the present lemma. Lemma 1 remains true in its present form when TensorSketch is used instead of CountSketch.

To see that Lemma 2 remains true with the new condition when $\mathbf{S}$ is replaced by $\mathbf{T}$ , let $\mathbf{T}\in\mathbb{R}^{L\times\tilde{I}}$ be a TensorSketch operator, and let $\mathbf{U}\in\mathbb{R}^{\tilde{I}\times K}$ be a matrix with orthonormal columns. Define $\mathbf{C}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}(\mathbf{T}\mathbf{U})^{\top}(\mathbf{T}\mathbf{U})$ , and let $\mathbf{E}$ be defined as in (16), but in terms of the corresponding quantities from TensorSketch. Then $\mathbf{E}=\mathbf{C}-\mathbf{I}^{(K)}$ , according to Lemma 1. Using Fact 6 (i), condition (74), and the fact that $\|\mathbf{U}\|_{\textup{F}}^{2}=K$ , we have

[TABLE]

All the other lemmas will remain the same when $\mathbf{T}$ is used instead of $\mathbf{S}$ , with the only exception that Lemmas 3 and 5 now will use the new condition in (74) instead of the old one in (22). Using exactly the same arguments as in the proof of Proposition 1 at the end of Section 7.1 will therefore give the bound in (75). ∎

Proof (Proposition 2)

Recall that $\bm{\mathscr{X}}$ and $\hat{\bm{\mathscr{X}}}$ are defined as in (1) and (2), respectively, and the coefficients $\hat{\lambda}_{1},\ldots,\hat{\lambda}_{K}$ are defined as

[TABLE]

for $k\in[K]$ . We then have

[TABLE]

Letting $\mathcal{I}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}[I_{1}]\times\cdots\times[I_{N}]$ , we have

[TABLE]

where the inequality follows from Cauchy–Schwarz inequality. Combining (78) and (79) we get

[TABLE]

where the second inequality is a well-known relation (see e.g. equation (2.3.7) in Golub and Van Loan (2013)). Combining (80) and Lemma 6 gives that

[TABLE]

with probability at least $1-\frac{1}{\beta}$ . Setting $\alpha=4$ then yields the same bounds as in the statement in Proposition 2. ∎

7.4 Proof of Proposition 3

Proof

Note that $\mathbf{T}\mathbf{M}$ is of size $L\times R$ , with $L>R$ . So $\sigma_{R}(\mathbf{T}\mathbf{M})$ is the smallest singular value of $\mathbf{T}\mathbf{M}$ . Suppose

[TABLE]

To simplify notation, let $\varepsilon\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}1-1/\alpha$ . Using Theorem 8.6.1 in Golub and Van Loan (2013) and Fact 6 (ii), we have that with probability at least $1-\frac{1}{\beta}$ , the following hold:

[TABLE]

and

[TABLE]

We therefore have

[TABLE]

with probability at least $1-\frac{1}{\beta}$ . Setting $\alpha=4$ gives us the bounds in Proposition 3. ∎

8 Conclusion

We have presented a new fast randomized algorithm for computing matrix ID, which utilizes CountSketch. We have then shown how this method can be extended to computing the tensor ID of CP tensors. For both the matrix and tensor settings, we provided performance guarantees. To the best of our knowledge, we provide the first performance guarantees for any randomized tensor ID algorithm. We conducted several numerical experiments on both synthetic and real data. These experiments showed that our algorithms maintain the same accuracy as other randomized methods, but with a much shorter run time, running at least an order of magnitude faster on the larger matrices and tensors.

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ailon and Chazelle (2009) Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing , 39(1):302–322, 2009.
2Atkinson and Han (2009) Kendall Atkinson and Weimin Han. Theoretical Numerical Analysis: A Functional Analysis Framework . Number 39 in Texts in Applied Mathematics. Springer-Verlag, New York, 3rd edition, 2009. ISBN 978-1-4419-0457-7.
3Avron et al. (2010) Haim Avron, Petar Maymounkov, and Sivan Toledo. Blendenpik: Supercharging LAPACK’s least-squares solver. SIAM Journal on Scientific Computing , 32(3):1217–1236, 2010.
4Avron et al. (2014) Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace Embeddings for the Polynomial Kernel. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , pages 2258–2266, Cambridge, MA, USA, 2014. MIT Press.
5Battaglino et al. (2018) Casey Battaglino, Grey Ballard, and Tamara G. Kolda. A practical randomized CP tensor decomposition. SIAM Journal on Matrix Analysis and Applications , 39(2):876–901, 2018.
6Beylkin and Mohlenkamp (2002) Gregory Beylkin and Martin J. Mohlenkamp. Numerical operator calculus in higher dimensions. Proceedings of the National Academy of Sciences , 99(16):10246–10251, 2002.
7Beylkin and Mohlenkamp (2006) Gregory Beylkin and Martin J. Mohlenkamp. Algorithms for Numerical Analysis in High Dimensions. SIAM Journal on Scientific Computing , 26(6):2133–2159, July 2006.
8Biagioni et al. (2015) David J. Biagioni, Daniel Beylkin, and Gregory Beylkin. Randomized interpolative decomposition of separated representations. Journal of Computational Physics , 281(C):116–134, January 2015. ISSN 0021-9991. doi: 10.1016/j.jcp.2014.10.009 .