Randomized algorithms for low-rank tensor decompositions in the Tucker   format

Rachel Minster; Arvind K. Saibaba; Misha E. Kilmer

arXiv:1905.07311·math.NA·May 20, 2019

Randomized algorithms for low-rank tensor decompositions in the Tucker format

Rachel Minster, Arvind K. Saibaba, Misha E. Kilmer

PDF

TL;DR

This paper develops and analyzes randomized algorithms for low-rank tensor decompositions in the Tucker format, enabling efficient compression of large-scale datasets with applications in image and text data.

Contribution

It introduces randomized variants of HOSVD and STHOSVD algorithms with probabilistic error analysis and adaptive, structure-preserving features for large-scale tensor data.

Findings

01

Randomized algorithms achieve accurate low-rank tensor approximations.

02

Adaptive method finds low-rank representations without prior rank knowledge.

03

Structure-preserving variant handles large sparse tensors effectively.

Abstract

Many applications in data science and scientific computing involve large-scale datasets that are expensive to store and compute with, but can be efficiently compressed and stored in an appropriate tensor format. In recent years, randomized matrix methods have been used to efficiently and accurately compute low-rank matrix decompositions. Motivated by this success, we focus on developing randomized algorithms for tensor decompositions in the Tucker representation. Specifically, we present randomized versions of two well-known compression algorithms, namely, HOSVD and STHOSVD. We present a detailed probabilistic analysis of the error of the randomized tensor algorithms. We also develop variants of these algorithms that tackle specific challenges posed by large-scale datasets. The first variant adaptively finds a low-rank representation satisfying a given tolerance and it is beneficial…

Tables6

Table 1. Table 1: Computational Cost for the HOSVD, R-HOSVD, STHOSVD, and R-STHOSVD algorithms. The first term in each expression is the cost of computing an SVD of the mode unfoldings, and the second is the cost of forming the core tensor.

Algorithm	Cost for $𝓧 \in ℝ^{I_{1} \times \dots \times I_{d}}$	Cost for $𝓧 \in ℝ^{I \times \dots \times I}$
HOSVD	$𝒪 (\sum_{j = 1}^{d} I_{j} I_{1 : d} + \sum_{j = 1}^{d} r_{1 : j} I_{j : d})$	$𝒪 (d I^{d + 1} + \sum_{j = 1}^{d} r^{j} I^{d - j + 1})$
R-HOSVD	$𝒪 (\sum_{j = 1}^{d} r_{j} I_{1 : d} + \sum_{j = 1}^{d} r_{1 : j} I_{j : d})$	$𝒪 (d r I^{d} + \sum_{j = 1}^{d} r^{j} I^{d - j + 1})$
STHOSVD	$𝒪 (\sum_{j = 1}^{d} I_{j} r_{1 : j - 1} I_{j : d} + \sum_{j = 1}^{d} r_{1 : j} I_{j + 1 : d})$	$𝒪 (\sum_{j = 1}^{d} r^{j - 1} I^{d - j + 2} + r^{j} I^{d - j})$
R-STHOSVD	$𝒪 (\sum_{j = 1}^{d} r_{1 : j} I_{j : d} + \sum_{j = 1}^{d} r_{1 : j} I_{j + 1 : d})$	$𝒪 (\sum_{j = 1}^{d} r^{j} I^{d - j + 1} + r^{j} I^{d - j})$

Table 2. Table 2: Summary of sparse tensor examples from the FROSTT database—we include the details for both the full datasets and the condensed datasets used in our experiments.

Original Tensor	Order	Size	Nonzeros
NELL-2	3	$12092 \times 9184 \times 28818$	$76, 879, 419$
Enron	4	$6066 \times 5699 \times 244268 \times 1176$	$54, 202, 099$

Table 3. Table 3: Runtime in seconds of the HOSVD, R-HOSVD, STHOSVD, and R-STHOSVD algorithms on the Hilbert tensor 𝓧 𝓧 \boldsymbol{\mathcal{X}} with d = 5 𝑑 5 d=5 , averaged over three runs. Each algorithm is run with target rank ( 5 , 5 , 5 , 5 , 5 ) 5 5 5 5 5 (5,5,5,5,5) , and the randomized algorithms use oversampling parameter p = 5 𝑝 5 p=5 . The STHOSVD and R-STHOSVD algorithms use the processing order ρ = [ 1 , 2 , 3 , 4 , 5 ] 𝜌 1 2 3 4 5 \rho=[1,2,3,4,5] .

	Algorithm
		HOSVD	R-HOSVD	STHOSVD	R-STHOSVD
	25	$3.0030$	$1.2609$	$0.6455$	$0.2875$
$I_{j}$	35	$14.5255$	$5.5288$	$3.1974$	$1.2090$
$1 \leq j \leq d$	45	$70.0536$	$17.3744$	$15.0629$	$3.5587$
	50	$101.4981$	$27.7725$	$21.9745$	$5.5606$

Table 4. Table 4: Rank (size of the core tensor) obtained by the adaptive STHOSVD algorithm ( Algorithm 5 ) with different relative error tolerances ϵ italic-ϵ \epsilon as the size of each dimension of 𝓧 𝓧 \boldsymbol{\mathcal{X}} increase. The inputs are ϵ italic-ϵ \epsilon , block size b = 1 𝑏 1 b=1 , and processing order ρ = [ 1 , 2 , 3 ] 𝜌 1 2 3 \rho=[1,2,3] .

	tolerance $ϵ$
		$10^{- 3}$	$10^{- 4}$	$10^{- 5}$	$10^{- 6}$	$10^{- 7}$
	25	$(4, 4, 4)$	$(5, 5, 5)$	$(6, 6, 6)$	$(7, 7, 7)$	$(8, 8, 8)$
$I$	50	$(5, 5, 5)$	$(6, 6, 6)$	$(7, 7, 7)$	$(8, 8, 8)$	$(9, 9, 9)$
$1 \leq j \leq d$	100	$(5, 5, 5)$	$(6, 6, 6)$	$(8, 8, 8)$	$(9, 9, 9)$	$(10, 10, 10)$

Table 5. Table 5: A comparison of the adaptive R-STHOSVD algorithm Algorithm 5 to the STHOSVD. We first obtained the rank of the core tensor with the requested relative error tolerance from the adaptive algorithm. Then we compared the actual error of the approximation from the adaptive R-STHOSVD to that of an STHOSVD with the same rank. The processing order for all runs was ρ = [ 2 , 1 , 3 ] 𝜌 2 1 3 \rho=[2,1,3] . *Each STHOSVD was computed with the corresponding rank found in the second column.

Error tolerance $ϵ$	Corresponding rank $𝐫$	Actual error	Rank- $𝐫$ STHOSVD error*
$0.25$	$(3, 10, 1)$	$0.1995$	$0.1995$
$0.2$	$(10, 23, 1)$	$0.1799$	$0.1796$
$0.15$	$(22, 51, 5)$	$0.1421$	$0.1403$
$0.1$	$(32, 114, 8)$	$0.0965$	$0.0946$
$0.05$	$(38, 237, 10)$	$0.0400$	$0.0381$
$0.01$	$(40, 381, 10)$	$0.0057$	$0.0055$

Table 6. Table 6: The relative error and runtime of both SP-STHOSVD and R-STHOSVD on both the condensed and subsampled Enron dataset and the condensed NELL-2 dataset as the target rank ( r , r , r ) 𝑟 𝑟 𝑟 (r,r,r) increases. The processing order was ρ = [ 3 , 1 , 2 ] 𝜌 3 1 2 \rho=[3,1,2] , and the oversampling parameter was p = 5 𝑝 5 p=5 . Note that the rank is the same for each mode for simplicity, and that the input rank for the R-STHOSVD was ( r + p , r + p , r + p ) 𝑟 𝑝 𝑟 𝑝 𝑟 𝑝 (r+p,r+p,r+p) so the approximations have the same size.

	NELL-2
	Relative Error		Runtime in seconds
Target Rank	SP-STHOSVD	R-STHOSVD	SP-STHOSVD	R-STHOSVD
$30$	$0.2968$	$0.1319$	$0.5690$	$17.0642$
$60$	$0.2282$	$0.0914$	$0.9606$	$18.9203$
$90$	$0.1950$	$0.0699$	$1.5889$	$23.3303$
$120$	$0.1666$	$0.0573$	$2.0706$	$28.9399$
$150$	$0.1431$	$0.0478$	$2.1310$	$33.6867$
$180$	$0.1201$	$0.0417$	$2.3678$	$39.0644$
$210$	$0.1181$	$0.0367$	$3.0832$	$45.5227$
$240$	$0.1095$	$0.0326$	$3.7282$	$52.4856$

Equations117

x_{i_{1}, \dots, i_{d}} 1 \leq i_{j} \leq I_{j} j = 1, \dots, d .

x_{i_{1}, \dots, i_{d}} 1 \leq i_{j} \leq I_{j} j = 1, \dots, d .

Y_{i_{1}, \dots, i_{j - 1}, k, i_{j + 1}, \dots, i_{d}} = i_{j} = 1 \sum I_{j} x_{i_{1}, \dots, i_{d}} a_{k i_{j}} 1 \leq k \leq K j = 1, \dots, d .

Y_{i_{1}, \dots, i_{j - 1}, k, i_{j + 1}, \dots, i_{d}} = i_{j} = 1 \sum I_{j} x_{i_{1}, \dots, i_{d}} a_{k i_{j}} 1 \leq k \leq K j = 1, \dots, d .

∥ X - X \bigtimes_{j = 1}^{d} Π_{j} ∥_{F}^{2} = j = 1 \sum d ∥ X \bigtimes_{i = 1}^{j - 1} Π_{i} \times_{j} (I - Π_{j}) ∥_{F}^{2} \leq j = 1 \sum d ∥ X - X \times_{j} Π_{j} ∥_{F}^{2} .

∥ X - X \bigtimes_{j = 1}^{d} Π_{j} ∥_{F}^{2} = j = 1 \sum d ∥ X \bigtimes_{i = 1}^{j - 1} Π_{i} \times_{j} (I - Π_{j}) ∥_{F}^{2} \leq j = 1 \sum d ∥ X - X \times_{j} Π_{j} ∥_{F}^{2} .

A \otimes B = a_{11} B a_{21} B ⋮ a_{m 1} B a_{12} B a_{22} B ⋮ a_{m 2} B \dots \dots ⋱ \dots a_{1 n} B a_{2 n} B ⋮ a_{mn} B \in R^{mk \times n ℓ} .

A \otimes B = a_{11} B a_{21} B ⋮ a_{m 1} B a_{12} B a_{22} B ⋮ a_{m 2} B \dots \dots ⋱ \dots a_{1 n} B a_{2 n} B ⋮ a_{mn} B \in R^{mk \times n ℓ} .

(A \otimes B) (C \otimes D)

(A \otimes B) (C \otimes D)

(A \otimes B)^{⊤}

Y_{(j)} = A_{j} X_{(j)} (A_{d}^{⊤} \otimes A_{d - 1}^{⊤} \otimes \dots \otimes A_{j + 1}^{⊤} \otimes A_{j - 1}^{⊤} \otimes \dots \otimes A_{1}^{⊤}) .

Y_{(j)} = A_{j} X_{(j)} (A_{d}^{⊤} \otimes A_{d - 1}^{⊤} \otimes \dots \otimes A_{j + 1}^{⊤} \otimes A_{j - 1}^{⊤} \otimes \dots \otimes A_{1}^{⊤}) .

∥ X - X ∥_{F}^{2} \leq j = 1 \sum d ∥ X \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2} = j = 1 \sum d i = r_{j} + 1 \sum I_{j} σ_{i}^{2} (X_{(j)}) .

∥ X - X ∥_{F}^{2} \leq j = 1 \sum d ∥ X \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2} = j = 1 \sum d i = r_{j} + 1 \sum I_{j} σ_{i}^{2} (X_{(j)}) .

Δ_{j}^{2} (X) \equiv i = r_{j} + 1 \sum I_{j} σ_{i}^{2} (X_{(j)}) j = 1, \dots, d . \vspace - .3 c m

Δ_{j}^{2} (X) \equiv i = r_{j} + 1 \sum I_{j} σ_{i}^{2} (X_{(j)}) j = 1, \dots, d . \vspace - .3 c m

∥ X - X ∥_{F}^{2} =

∥ X - X ∥_{F}^{2} =

Y \in S min ∥ X - Y ∥_{F} = ∥ X - X_{opt} ∥_{F} .

Y \in S min ∥ X - Y ∥_{F} = ∥ X - X_{opt} ∥_{F} .

∥ X - X ∥_{F} \leq d ∥ X - X_{opt} ∥_{F} .

∥ X - X ∥_{F} \leq d ∥ X - X_{opt} ∥_{F} .

Δ_{j} (X) \leq ∥ X - X_{opt} ∥_{F} j = 1, \dots, d .

Δ_{j} (X) \leq ∥ X - X_{opt} ∥_{F} j = 1, \dots, d .

Cost = 2 (r + p) O (nnz (X)) + O (r^{2} (m + n)),

Cost = 2 (r + p) O (nnz (X)) + O (r^{2} (m + n)),

E_{Ω} ∥ X - Q Q^{⊤} X ∥_{F}^{2} \leq E_{Ω} ∥ X - Q B_{r} ∥_{F}^{2} \leq (1 + \frac{r}{p - 1}) j = r + 1 \sum m i n {m, n} σ_{j}^{2} (X) .

E_{Ω} ∥ X - Q Q^{⊤} X ∥_{F}^{2} \leq E_{Ω} ∥ X - Q B_{r} ∥_{F}^{2} \leq (1 + \frac{r}{p - 1}) j = r + 1 \sum m i n {m, n} σ_{j}^{2} (X) .

E_{Ω} ∥ X - QQ^{⊤} X ∥_{F} \leq 1 + \frac{r}{p - 1} j = r + 1 \sum m i n {m, n} σ_{j}^{2} (X)^{1/2},

E_{Ω} ∥ X - QQ^{⊤} X ∥_{F} \leq 1 + \frac{r}{p - 1} j = r + 1 \sum m i n {m, n} σ_{j}^{2} (X)^{1/2},

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq

\leq

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2} \leq E_{{Ω_{k}}_{k = 1}^{d}} j = 1 \sum d ∥ X \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2} = j = 1 \sum d E_{Ω_{j}} ∥ X \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2},

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2} \leq E_{{Ω_{k}}_{k = 1}^{d}} j = 1 \sum d ∥ X \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2} = j = 1 \sum d E_{Ω_{j}} ∥ X \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2},

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2} \leq j = 1 \sum d (1 + \frac{r _{j}}{p - 1}) Δ_{j}^{2} (X) .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2} \leq j = 1 \sum d (1 + \frac{r _{j}}{p - 1}) Δ_{j}^{2} (X) .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq (E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2})^{1/2} \leq (j = 1 \sum d (1 + \frac{r _{j}}{p - 1}) Δ_{j}^{2} (X))^{1/2} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq (E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2})^{1/2} \leq (j = 1 \sum d (1 + \frac{r _{j}}{p - 1}) Δ_{j}^{2} (X))^{1/2} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq (d + \frac{\sum _{j = 1}^{d} r _{j}}{p - 1})^{1/2} ∥ X - X_{opt} ∥_{F} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq (d + \frac{\sum _{j = 1}^{d} r _{j}}{p - 1})^{1/2} ∥ X - X_{opt} ∥_{F} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq 2 ∥ X - X_{HOSVD} ∥_{F} or E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq 2 d ∥ X - X_{opt} ∥_{F} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq 2 ∥ X - X_{HOSVD} ∥_{F} or E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq 2 d ∥ X - X_{opt} ∥_{F} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq 1 + ϵ ∥ X - X_{HOSVD} ∥_{F} or E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq d (1 + ϵ) ∥ X - X_{opt} ∥_{F} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq 1 + ϵ ∥ X - X_{HOSVD} ∥_{F} or E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq d (1 + ϵ) ∥ X - X_{opt} ∥_{F} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F} \leq

\leq

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2} = j = 1 \sum d E_{{Ω_{k}}_{k = 1}^{d}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} = j = 1 \sum d E_{{Ω_{k}}_{k = 1}^{j}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} .

E_{{Ω_{k}}_{k = 1}^{d}} ∥ X - X ∥_{F}^{2} = j = 1 \sum d E_{{Ω_{k}}_{k = 1}^{d}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} = j = 1 \sum d E_{{Ω_{k}}_{k = 1}^{j}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} .

E_{{Ω_{k}}_{k = 1}^{j}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} = E_{{Ω_{k}}_{k = 1}^{j - 1}} {E_{Ω_{j}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2}} .

E_{{Ω_{k}}_{k = 1}^{j}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} = E_{{Ω_{k}}_{k = 1}^{j - 1}} {E_{Ω_{j}} ∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2}} .

∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} = ∥ G^{(j - 1)} \bigtimes_{i = 1}^{j - 1} A_{i} \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2} .

∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2} = ∥ G^{(j - 1)} \bigtimes_{i = 1}^{j - 1} A_{i} \times_{j} (I - A_{j} A_{j}^{⊤}) ∥_{F}^{2} .

Z_{j} \equiv d - j terms I \otimes \dots \otimes I \otimes A_{j - 1} \otimes \dots \otimes A_{1} .

Z_{j} \equiv d - j terms I \otimes \dots \otimes I \otimes A_{j - 1} \otimes \dots \otimes A_{1} .

∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2}

∥ X^{(j - 1)} - X^{(j)} ∥_{F}^{2}

\leq ∥ (I - A_{j} A_{j}^{⊤}) G_{(j)}^{(j - 1)} ∥_{F}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Randomized Algorithms for Low-rank Tensor Decompositions in the Tucker Format††thanks: \fundingThe authors would like to acknowledge the support of the National Science Foundation through the grants DMS-1821148 (M.E.K) and DMS-1821149 (R.M. and A.K.S).

Rachel Minster Department of Mathematics, North Carolina State University, [email protected]

Arvind K. Saibaba Department of Mathematics, North Carolina State University, [email protected]

Misha E. Kilmer Department of Mathematics, Tufts University, [email protected]

Abstract

Many applications in data science and scientific computing involve large-scale datasets that are expensive to store and compute with, but can be efficiently compressed and stored in an appropriate tensor format. In recent years, randomized matrix methods have been used to efficiently and accurately compute low-rank matrix decompositions. Motivated by this success, we focus on developing randomized algorithms for tensor decompositions in the Tucker representation. Specifically, we present randomized versions of two well-known compression algorithms, namely, HOSVD and STHOSVD. We present a detailed probabilistic analysis of the error of the randomized tensor algorithms. We also develop variants of these algorithms that tackle specific challenges posed by large-scale datasets. The first variant adaptively finds a low-rank representation satisfying a given tolerance and it is beneficial when the target-rank is not known in advance. The second variant preserves the structure of the original tensor, and is beneficial for large sparse tensors that are difficult to load in memory. We consider several different datasets for our numerical experiments: synthetic test tensors and realistic applications such as the compression of facial image samples in the Olivetti database and word counts in the Enron email dataset.

1 Introduction

Tensors, or multi-way arrays, appear in a wide range of applications such as signal processing; neuroscientific applications such as Electroencephalography; data mining; seismic data processing; machine learning applications such as facial recognition, handwriting digit classification, and latent semantic indexing; imaging; astronomy; and uncertainty quantification. For example, a database of gray scale images constitutes a third order array when each image is stored as a matrix, while a numerical simulation of system of a partial differential equations (PDEs) in three-dimensional space when tracking several parameters over time yields a five-dimensional dataset. Often, these datasets are treated as matrices rather than as tensors, suggesting that additional structure that could be leveraged for gaining insight and lowering computational cost is often underutilized and undiscovered.

A key step in processing and studying these datasets involves a compression step either to find an economical representation in memory, or to find principal directions of variability. While working with tensors there are many possible formats one may consider, and each format is equipped with a different notion of compression and rank. Examples of tensor formats include CANDECOMP/PARAFAC (CP), Tucker, Hierarchical Tucker, and Tensor Train, all of which have their respective benefits (see surveys [28, 20, 10, 11]). The CP format which represents a tensor as a sum of rank $-1$ outer products gives a compact and unique (under certain conditions) representation. Tucker generally finds a better fit for data by estimating the subspace of each mode, while Hierarchical-Tucker and Tensor Train are useful for very high-dimensional tensors. In this paper, we focus on the Tucker representation which is known for its favorable compression properties in a modest number of dimensions (3-7 modes). Given a multilinear rank $\mathbf{r}$ , the Tucker form finds a rank $\mathbf{r}$ representation of a tensor as a product of a core tensor and factor matrices typically having orthonormal columns. Popular algorithms for compression in the Tucker format can be found in [12, 41, 13], and a survey of approximation techniques can be found in [20]. Also, depending on how small the target rank for an approximation is compared to the original dimensions, high compression can be achieved. If the data is such that this is not possible, representing it in the Tucker format can still give insight into its principal directions, since Tucker is also a form of higher-dimensional principal component analysis (PCA).

In recent years, randomized matrix algorithms have gained popularity for developing low-rank matrix approximations (see the review [24, 30, 15]). These algorithms are easy to implement, computationally efficient for a wide range of matrices (e.g. sparse matrices, matrices that can be accessed only via matrix-vector products, and dense matrices that are difficult to load in memory), and have accuracy comparable with non-randomized algorithms. There is also well-developed error analysis applicable to several classes of random matrices for randomized algorithms. Even more recently, randomized algorithms have been developed for tensor decompositions (see below for a detailed review). In this paper, we make several contributions by proposing new algorithms that are accompanied by rigorous error analysis. The results in this paper enable the efficient computation of low-rank Tucker decompositions across a wide range of applications. The randomized algorithms developed here exploit the structure of tensor decompositions, while the analysis provides insight into the choice of parameters that control the tradeoff between accuracy and computational efficiency.

Contributions and Contents

In Section 2, we first review the necessary background information on tensors and randomized algorithms. Then, in Section 3, we present analyses of randomized versions of HOSVD and STHOSVD (proposed in [47] and [8] respectively). Our contributions on this front include the probabilistic analysis of the randomized versions of these algorithms in expectation, as well as analysis of the associated computational costs. In Section 4 we present adaptive randomized algorithms to compute low-rank tensor decompositions to be used in applications where the target rank is not known beforehand. In Section 5, we present a new randomized compression algorithm for large tensors, which produces a low-rank decomposition whose core tensor has entries taken directly from the tensor of interest. In this sense, the core tensor preserves the structure (e.g., sparsity, non-negativity) of the original tensor. For sparse tensors, our algorithm has the added benefit that the intermediate and final decompositions can be stored efficiently, thus enabling the computation of low-rank tensor decompositions of large, sparse tensors. To supplement the algorithm, we provide a probabilistic analysis in expectation. Finally, in Section 6, we test the performance of all algorithms on several synthetic tensors and real-world datasets, and discuss the performance of the proposed bounds.

Related Work

Several randomized algorithms have been proposed for computing low-rank tensor decompositions, e.g., Tucker format [8, 46, 29, 32, 40, 19, 47], CP format [19, 5, 6, 43], t-product [46], tensor networks [4], and Tensor Train format [8, 26]. Our work is most similar to [8, 19, 47]. The algorithm for randomized HOSVD is presented in [47], and the corresponding analysis is presented in [19] (both unpublished manuscripts). Randomized and adaptive versions of the STHOSVD were proposed and analyzed in [8], but our manuscript uses a different distribution of random matrices (see Section 3 for a justification of our choice), and provides bounds in expectation. To our knowledge, our proposed algorithm for producing structure-preserving tensor decompositions and the corresponding error analysis are novel. Related to this algorithm is the CUR-type decomposition for tensors proposed in [34, 14]. In contrast, our algorithm produces decompositions in which the core tensor (rather than the factor matrices, in the aforementioned references) retains entries from the original tensor.

2 Background

In this section, we introduce the necessary background information for working with tensors, and review the standard compression algorithms. We also discuss the optimal approximation of a tensor for comparison purposes. Finally, we review the relevant background for randomized matrix algorithms, specifically the randomized SVD.

2.1 Notation and preliminaries

We denote a $d$ -mode tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\dots\times I_{d}}$ with entries

[TABLE]

A tensor can be “unfolded” into a matrix by reordering the elements, and this process is known as matricization. There are $d$ different unfoldings for a $d$ -mode tensor. Each mode- $j$ unfolding arranges the resulting matrix such that the columns are the mode- $j$ fibers of the tensor. The mode- $j$ unfolding is denoted as $\mathbf{X}_{(j)}\in\mathbb{R}^{I_{j}\times(\prod_{k\neq j}I_{k})}$ for $j=1,\dots,d$ .

Tensor product

The tensor product (or mode product) is a fundamental operation for multiplying a tensor by a matrix. Given a matrix $\mathbf{A}\in\mathbb{R}^{K\times I_{j}}$ , the mode- $j$ product of a tensor $\boldsymbol{\mathcal{X}}$ with $\mathbf{A}$ is denoted $\boldsymbol{\mathcal{Y}}=\boldsymbol{\mathcal{X}}\times_{j}\mathbf{A}$ , and has dimension $\boldsymbol{\mathcal{Y}}\in\mathbb{R}^{I_{1}\times\dots I_{j-1}\times K\times I_{j+1}\times\dots\times I_{d}}$ . More specifically, the product can be expressed in terms of the entries of the tensor as

[TABLE]

The tensor product can also be expressed as the product of two matrices. That is, we can write $\boldsymbol{\mathcal{Y}}_{(j)}=\mathbf{A}\mathbf{X}_{(j)}$ for $j=1,\dots,d$ . The following lemma will be useful in our analysis.

Lemma 2.1.

Let $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\dots\times I_{d}}$ and let $\mathbf{\Pi}_{j}\in\mathbb{R}^{I_{j}\times I_{j}}$ be a sequence of $d$ orthogonal projectors. Then for $j=1,2,\dots,d$ ,

[TABLE]

The proof of this lemma can be found in [41, Theorem 5.1].

Tucker representation

The Tucker format of a tensor $\boldsymbol{\mathcal{X}}$ of rank $(r_{1},\dots,r_{d})$ consists of a core tensor $\boldsymbol{\mathcal{G}}\in\mathbb{R}^{r_{1}\times\dots\times r_{d}}$ and factor matrices $\{\mathbf{A}_{j}\}_{j=1}^{d}$ with each $\mathbf{A}_{j}\in\mathbb{R}^{I_{j}\times r_{j}}$ . For short, it is written as $[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ and represents $\boldsymbol{\mathcal{X}}=\boldsymbol{\mathcal{G}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}$ .

Note that storing a tensor in Tucker form is beneficial as it requires less storage than a full tensor when the target rank is significantly less than the original dimension. For a $d$ -mode tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I\times I\times\dots\times I}$ and target rank $(r,r,\dots,r)$ with $r\ll I$ , the cost of storing the Tucker form of $\boldsymbol{\mathcal{X}}$ is $\mathcal{O}(r^{d}+drI)$ , compared to $\mathcal{O}(I^{d})$ for a full tensor.

Kronecker products

The Kronecker product of two matrices $\mathbf{A}\in\mathbb{R}^{m\times n}$ and $\mathbf{B}\in\mathbb{R}^{k\times\ell}$ is

[TABLE]

We also note some properties of Kronecker products that will be useful in our analysis, namely

[TABLE]

Kronecker products are also useful for expressing tensor mode products in terms of matrix-matrix multiplications. Suppose $\boldsymbol{\mathcal{Y}}=\boldsymbol{\mathcal{X}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}$ , then

[TABLE]

2.2 HOSVD/STHOSVD

The Higher Order SVD (HOSVD) and Sequentially Truncated Higher Order SVD (STHOSVD) are two popular algorithms for computing low-rank tensor decompositions in the Tucker format. Given a $d$ -mode tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\dots\times I_{d}}$ and target rank $\mathbf{r}=(r_{1},\dots,r_{d})$ , both algorithms give a compressed representation for the tensor in the Tucker format $\boldsymbol{\mathcal{X}}\approx[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}],$ where $\boldsymbol{\mathcal{G}}\in\mathbb{R}^{r_{1}\times\dots\times r_{d}}$ is the core tensor of reduced rank, and $\{\mathbf{A}_{j}\}_{j=1}^{d}$ are factor matrices such that $\boldsymbol{\mathcal{X}}=\boldsymbol{\mathcal{G}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}$ . The factor matrices $\mathbf{A}_{j}$ all have orthonormal columns, and each $\mathbf{A}_{j}\in\mathbb{R}^{I_{j}\times r_{j}}$ .

HOSVD

In the HOSVD algorithm, each mode is handled separately. The factor matrix $\mathbf{A}_{j}$ is formed from the first $r_{j}$ left singular vectors of $\mathbf{X}_{(j)}$ . Once all factor matrices $\mathbf{A}_{j}$ are found, the core tensor is formed by $\boldsymbol{\mathcal{G}}=\boldsymbol{\mathcal{X}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}^{\top}$ . The error in approximating $\boldsymbol{\mathcal{X}}$ using the HOSVD depends on the error in each mode, as shown in the following theorem, the proof of which can be found in [41, Theorem 5.1].

Theorem 2.2.

Let $\widehat{\boldsymbol{\mathcal{X}}}=[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ be the rank- $\mathbf{r}$ approximation to $d$ -mode tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\dots\times I_{d}}$ using the HOSVD algorithm. Then

[TABLE]

This theorem says that the error in the rank $\mathbf{r}$ approximation of the tensor $\boldsymbol{\mathcal{X}}$ computed using the HOSVD is the sum of squares of the discarded singular values from each mode unfolding. To simplify the upper bound, we introduce the notation

[TABLE]

With this notation, the error in the HOSVD satisfies $\|\boldsymbol{\mathcal{X}}-\widehat{\boldsymbol{\mathcal{X}}}\|_{F}\leq\left(\sum_{j=1}^{d}\Delta_{j}^{2}(\boldsymbol{\mathcal{X}})\right)^{1/2}$ .

STHOSVD

An alternative to the HOSVD is the sequentially truncated HOSVD (STHOSVD) algorithm which also produces a compressed representation in the Tucker format. In contrast to HOSVD which processes the modes independently, the STHOSVD processes the modes sequentially. This makes the order in which the modes are processed important, since we may obtain different approximations by using different processing orders. At each stage, the core tensor is unfolded (initialized as the tensor $\boldsymbol{\mathcal{X}}$ ) and the factor matrix $\mathbf{A}_{j}$ is obtained by taking the first $r_{j}$ singular vectors. A new core tensor is obtained by projecting the core tensor onto the subspace spanned by the columns of the factor matrix. A characteristic feature of the STHOSVD is that the core tensor shrinks at each iteration thus making the later modes cheaper to compute.

Given a tensor $\boldsymbol{\mathcal{X}}$ and the processing order $\rho=[1,2,\dots,d]$ , the rank- $\mathbf{r}$ STHOSVD approximation of $\boldsymbol{\mathcal{X}}$ is $\widehat{\boldsymbol{\mathcal{X}}}=[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ , where each factor matrix $\mathbf{A}_{j}\in\mathbb{R}^{I_{j}\times r_{j}}$ has orthonormal columns, and the core tensor $\boldsymbol{\mathcal{G}}\in\mathbb{R}^{r_{1}\times r_{2}\times\dots\times r_{d}}$ is defined as $\boldsymbol{\mathcal{G}}=\boldsymbol{\mathcal{X}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}^{\top}.$ At the $j$ -th step of the process, we have a partially truncated core tensor $\boldsymbol{\mathcal{G}}^{(j)}\in\mathbb{R}^{r_{1},\dots,r_{j},I_{j+1},\dots,I_{d}}$ defined as $\boldsymbol{\mathcal{G}}^{(j)}=\boldsymbol{\mathcal{X}}\bigtimes_{i=1}^{j}\mathbf{A}_{i}^{\top}.$ Then the $j$ -th partial approximation, of rank $(r_{1},\dots,r_{j},I_{j+1},\dots,I_{d})$ , can be defined as $\widehat{\boldsymbol{\mathcal{X}}}^{(j)}=\boldsymbol{\mathcal{G}}^{(j)}\bigtimes_{i=1}^{j}\mathbf{A}_{i}.$ The algorithm is initialized with $\widehat{\boldsymbol{\mathcal{X}}}^{(0)}=\boldsymbol{\mathcal{X}}$ .

The approximation error in this case is the sum of errors in the successive approximations, and has the same upper bound as that of HOSVD. This is shown in the following theorem, which assumes that the processing order is $\rho=[1,2,\dots,d]$ . If a different processing order is taken, the bound will remain the same, so this assumption is taken for ease of notation. The proof of this theorem can be found in [41, Theorem 6.5].

Theorem 2.3.

Let $\widehat{\boldsymbol{\mathcal{X}}}=[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ be the rank- $\mathbf{r}$ STHOSVD approximation to $d$ -mode tensor $\boldsymbol{\mathcal{X}}$ with $\rho=[1,2,\dots,d]$ the processing order of modes. Then

[TABLE]

The computational cost of the STHOSVD is lower than the HOSVD, which was established in [41] but is also reviewed in Section 3.3. Although both the error in the HOSVD and the STHOSVD satisfy the same upper bound, it is not clear which algorithm has a lower error. There is strong numerical evidence to suggest that STHOSVD typically has a lower error, although counterexamples to this claim have been found [41]. For these reasons, STHOSVD is preferable to HOSVD since it has a lower cost and the same worst case error bound. A downside to STHOSVD is that the processing order $\rho$ has to be determined in advance; some heuristics for this choice are given in [41].

2.3 Best Approximation

We would like to find an optimal rank- $\mathbf{r}$ approximation of a given tensor $\boldsymbol{\mathcal{X}}$ , which we will call $\widehat{\boldsymbol{\mathcal{X}}}_{\text{opt}}$ . Let $\mathcal{S}=\{\boldsymbol{\mathcal{Y}}\in\mathbb{R}^{I_{1}\times I_{2}\times\dots\times I_{d}}:\text{rank}(\mathbf{Y}_{(j)})\leq r_{j},j=1,\dots,d\}$ . Then $\widehat{\boldsymbol{\mathcal{X}}}_{\text{opt}}$ is an optimal tensor which satisfies the condition

[TABLE]

The Eckart-Young theorem [18] states that the optimal rank- $r$ approximation to a matrix $\mathbf{A}$ can be constructed using the SVD truncated to rank- $r$ . Unfortunately, an analog of this result for Tucker forms does not exist in higher dimensions. The existence of $\widehat{\boldsymbol{\mathcal{X}}}_{\text{opt}}$ is guaranteed by [23, Theorem 10.8]. In general, computing $\widehat{\boldsymbol{\mathcal{X}}}_{\text{opt}}$ requires the solution of an optimization problem. In [13], the higher order orthogonal iteration (HOOI), was proposed to compute the “best” approximation by generating a sequence of iterates by cycling through the modes sequentially. Because the HOOI algorithm requires repeated computations with the tensor $\boldsymbol{\mathcal{X}}$ , its implementation for large-scale tensors is challenging because of the overwhelming computational cost. Although neither the HOSVD nor the STHOSVD produce an optimal rank- $\mathbf{r}$ approximation, they do satisfy the inequality

[TABLE]

The proofs are available for the HOSVD (Theorem 10.3) and STHOSVD (Theorem 10.5) in [23]. The proof requires the observation that

[TABLE]

We highlight this inequality since it will be important for our subsequent analysis. The inequality Eq. 3 suggests that the outputs of the HOSVD and the STHOSVD are accurate for low dimensions and can be employed in three different ways: either as approximations to $\widehat{\boldsymbol{\mathcal{X}}}_{\text{opt}}$ , as starting guesses to the HOOI algorithm, or to fit CP models.

2.4 Randomized SVD

It is well-known that computing the full SVD of a matrix costs $\mathcal{O}(mn^{2})$ , assuming $m\geq n$ . When the dimensions of a matrix $\mathbf{X}\in\mathbb{R}^{m\times n}$ are very large, the computational cost of a full SVD may be prohibitively expensive. Randomized SVD, popularized by [24], is a computationally efficient way to compute a rank- $r$ approximation of a matrix $\mathbf{X}$ . Assuming that $\mathbf{X}$ is approximately low-rank or has singular values that decay rapidly, the randomized SVD delivers a good low-rank representation of the matrix. Compared to the full SVD, the randomized SVD is much more computationally efficient across a wide-range of matrices.

We now describe the randomized SVD algorithm for computing a rank- $r$ approximation to $\mathbf{X}$ , where the target rank $1\leq r\leq\mathsf{rank}(\mathbf{X})$ . The randomized algorithm has two distinct stages—the range finding stage, and the postprocessing stage to compute the low-rank approximation. In the range finding stage, we first draw a random matrix $\mathbf{\Omega}\in\mathbb{R}^{n\times(r+p)}$ , where $r$ is the desired target rank, and $p$ is a small nonnegative integer used as an oversampling parameter. While many choices for the distribution of $\mathbf{\Omega}$ are possible, in this paper, we use the Gaussian distribution, i.e., the entries $\mathbf{\Omega}_{ij}$ are independent and identically distributed $\mathcal{N}(0,1)$ random variables. Then, we compute the matrix $\mathbf{Y}=\mathbf{X}\mathbf{\Omega}$ whose columns consist of random linear combinations of the columns of $\mathbf{X}$ . By taking a thin QR factorization $\mathbf{Y}=\mathbf{Q}\mathbf{R}$ , we get a matrix $\mathbf{Q}$ with orthonormal columns whose span approximates the range of $\mathbf{X}$ . If the matrix $\mathbf{X}$ is approximately rank $r$ , $\mathcal{R}(\mathbf{Q})$ is a good approximation for $\mathcal{R}(\mathbf{X})$ as the range of $\mathbf{X}$ is characterized by just the first $r$ left singular vectors. We can then express $\mathbf{X}\approx\mathbf{Q}\mathbf{Q}^{\top}\mathbf{X}$ . In the second stage, we convert the low rank representation into the SVD format. To this end, we compute the thin SVD of $\mathbf{B}=\mathbf{Q}^{\top}\mathbf{X}$ , giving $\mathbf{B}=\widehat{\mathbf{U}}_{\mathbf{B}}\widehat{\mathbf{\Sigma}}\widehat{\mathbf{V}}^{\top}$ . We truncate this representation to rank $r$ by only retaining the first $r$ diagonal elements of $\mathbf{\Sigma}$ and drop the corresponding columns from $\widehat{\mathbf{U}}_{\mathbf{B}}$ and $\widehat{\mathbf{V}}$ . Finally, we compute $\widehat{\mathbf{U}}=\mathbf{Q}\widehat{\mathbf{U}}_{\mathbf{B}}$ , and obtain the low-rank approximation $\widehat{\mathbf{X}}=\widehat{\mathbf{U}}\widehat{\mathbf{\Sigma}}\widehat{\mathbf{V}}^{\top}$ . Algorithm 1 summarizes the process.

We briefly review the computational cost of the RandSVD algorithm. Let $T$ denote the computational cost of a matrix-vector product (denoted matvec) involving the matrix $\mathbf{X}$ . Then, the cost of the algorithm is

[TABLE]

where $\mathsf{nnz}$ denotes the number of nonzeros of $\mathbf{X}$ .

An error bound for Algorithm 1 in the Frobenius norm is presented below, and we will use this result frequently in our analysis. This theorem and its proof can be found in [46, Theorem 3].

Theorem 2.4.

Let $\mathbf{X}\in\mathbb{R}^{m\times n}$ , and $\mathbf{\Omega}\in\mathbb{R}^{n\times(r+p)}$ be a Gaussian random matrix. Suppose $\mathbf{Q}$ is obtained from Algorithm 1 with inputs target rank $r$ and oversampling parameter $p\geq 2$ such that $r+p\leq\min\{m,n\}$ , and let $\mathbf{B}_{r}$ be the rank- $r$ truncated SVD of $\mathbf{Q}^{\top}\mathbf{X}$ . Then, the error in expectation satisfies

[TABLE]

Remark 2.5.

We will use two slightly different formulations of this theorem in our later results. Instead of $\|\mathbf{X}-\mathbf{Q}\mathbf{B}_{r}\|_{F}^{2}$ , we will use $\|\mathbf{X}-\widehat{\mathbf{U}}\widehat{\mathbf{U}}^{\top}\mathbf{X}\|_{F}^{2}$ . It is straightforward to show the equivalence between the two forms, see [35, section 5.3] for the explicit details. We will also use

[TABLE]

*which can be obtained by applying Hölder’s inequality [27, Theorem 23.10] to the stated result in the theorem. *

We could use other distributions for the random matrix $\mathbf{\Omega}$ , but probabilistic bounds in expectation like Theorem 2.4 do not exist for any distribution other than Gaussian. There are large deviation bounds for other distributions but we do not consider them here.

3 Randomized HOSVD/STHOSVD

In this section, we present randomized algorithms that are modified versions of the HOSVD and STHOSVD. We also develop rigorous error analysis and compare the two algorithms in terms of computational cost.

3.1 Algorithms

As mentioned earlier, the HOSVD algorithm first computes an SVD of each mode unfolding to construct the factor matrices, which are then used to form the core tensor. Computing this decomposition for large, high-dimensional tensors can be prohibitively expensive. To address this computation cost, we replace a full SVD of each mode unfolding with a randomized SVD of each mode unfolding to construct the factor matrix. The procedure to compute the core tensor remains unchanged. This is reflected in Algorithm 2 and we call this the R-HOSVD algorithm. To our knowledge, this was first proposed in [47].

The randomized version of STHOSVD is obtained in a similar way; at each step, the SVD of the unfolded core tensor is replaced with a randomized SVD. This is shown in Algorithm 3 (we call this R-STHOSVD) and is similar to the algorithm proposed in [8]. The major difference of Algorithm 3 compared to [8] is the distribution of random matrices $\{\mathbf{\Omega}_{j}\}_{j=1}^{d}$ . The authors in [8] advocate $\mathbf{\Omega}_{j}$ constructed as a Khatri-Rao product of Gaussian random matrices as opposed to standard Gaussian random matrices which we use. The main reason for avoiding standard Gaussian random matrices appear to be because of the high storage costs; however, we note that the matrices $\mathbf{\Omega}_{j}$ need not be stored explicitly. Its entries may be generated on-the-fly, either column-wise or in appropriately sized blocks. We next analyze the error in the decompositions produced using the R-HOSVD and the R-STHOSVD algorithms.

3.2 Error Analysis

In the results below, we assume that the matrices $\{\mathbf{\Omega}_{j}\}_{j=1}^{d}$ are standard Gaussian random matrices of appropriate sizes.

Theorem 3.1 (Randomized HOSVD).

Let $\widehat{\boldsymbol{\mathcal{X}}}=[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ be the output of Algorithm 2 with inputs target rank $\mathbf{r}=(r_{1},r_{2},\dots,r_{d})$ and oversampling parameter $p\geq 2$ . Furthermore, assume that $p$ satisfies $r_{j}+p\leq\min\{I_{j},\prod_{i\neq j}I_{i}\}$ for $j=1,\dots,d$ . Then, the expected error in the approximation is

[TABLE]

Proof 3.2.

From Lemma 2.1, we can write

[TABLE]

where the equality comes from linearity of expectations and the independence of $\mathbf{\Omega}_{j}$ for each mode $j$ . We can unfold each term in the summation as $\|\boldsymbol{\mathcal{X}}\times_{j}(\mathbf{I}-\mathbf{A}_{j}\mathbf{A}_{j}^{\top})\|_{F}^{2}=\|(\mathbf{I}-\mathbf{A}_{j}\mathbf{A}_{j}^{\top})\mathbf{X}_{(j)}\|_{F}^{2}$ . Then, by applying Theorem 2.4, we can bound the expected value of the squared error in each mode to obtain

[TABLE]

Finally, Hölder’s inequality gives

[TABLE]

For the second inequality, recall that $\Delta^{2}_{j}(\boldsymbol{\mathcal{X}})\leq\|\boldsymbol{\mathcal{X}}-\boldsymbol{\mathcal{\hat{X}}}_{\text{opt}}\|_{F}^{2}$ from Eq. 4. Thus, combined with the previous inequality, we have

[TABLE]

To compare this result to the approximation error obtained using the HOSVD algorithm, we consider a few special cases. Let $r=\max_{1\leq j\leq d}r_{j}$ . Then, if $p=r+1$ , this factor becomes

[TABLE]

Similarly, if we choose $p=\lceil\frac{r}{\epsilon}\rceil+1$ for some $\epsilon>0$ , the error satisfies

[TABLE]

This shows that the application of a randomized SVD in each mode of the tensor does not seriously deteriorate the accuracy compared to using an SVD.

Now consider the randomized STHOSVD approximation. When bounding the error in expectation in this case, it is important to note that at each intermediate step, the partially truncated core tensor is a random tensor. This is in contrast to the R-HOSVD, where we only needed to account for $\mathbf{\Omega}_{j}$ for each mode because the operations are independent across the modes. Computing the modes sequentially means, when processing mode $j$ , we must account for all the random matrices $\mathbf{\Omega}_{k}$ , where $k=1,\dots,j$ .

For this theorem, we use the same notation introduced in Section 2.2 for the STHOSVD, in that the partially truncated core tensor at step $j$ is $\boldsymbol{\mathcal{G}}^{(j)}=\boldsymbol{\mathcal{X}}\bigtimes_{i=1}^{j}\mathbf{A}_{i}^{\top}$ , giving a partial approximation $\widehat{\boldsymbol{\mathcal{X}}}^{(j)}=\boldsymbol{\mathcal{G}}^{(j)}\bigtimes_{i=1}^{j}\mathbf{A}_{i}$ .

Theorem 3.3 (Randomized STHOSVD).

Let $\widehat{\boldsymbol{\mathcal{X}}}=[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ be the output of Algorithm 3 with inputs target rank $\mathbf{r}=(r_{1},r_{2},\dots,r_{d})$ , processing order $\rho$ , and oversampling parameter $p\geq 2$ . Furthermore, assume that $p$ satisfies $r_{j}+p\leq\min\{I_{j},\prod_{i\neq j}I_{i}\}$ for $j=1,\dots,d$ . Then, the approximation error in expectation is

[TABLE]

Proof 3.4.

We first assume that the processing order is $\rho=[1,\dots,d]$ . The first equality in Lemma 2.1 and the linearity of expectations together give

[TABLE]

We have used the fact that the $j$ -th term in the summation does not depend on the random matrices $\{\mathbf{\Omega}_{k}\}_{k>j}$ . We first consider $\mathbb{E}_{\{\mathbf{\Omega}_{k}\}_{k=1}^{j}}\|\widehat{\boldsymbol{\mathcal{X}}}^{(j-1)}-\widehat{\boldsymbol{\mathcal{X}}}^{(j)}\|_{F}^{2}.$ Since all the $\mathbf{\Omega}_{k}$ ’s are independent, we can write the expectation in an iterated form as

[TABLE]

The $j$ -th term, which measures the difference in the sequential iterates, can be expressed as

[TABLE]

Now let

[TABLE]

If we unfold the difference $\widehat{\boldsymbol{\mathcal{X}}}^{(j-1)}-\widehat{\boldsymbol{\mathcal{X}}}^{(j)}$ along the $j$ -th mode, using Eq. 1 we have

[TABLE]

The inequality comes as $\mathbf{Z}_{j}$ has orthonormal columns for every $j$ .

Now, let $\mathbf{\Gamma}_{j}=\mathbf{G}_{(j)}^{(j-1)}$ for simplicity. We take expectations and bound this last quantity in Eq. 11 using Theorem 2.4 (keeping $\{\mathbf{\Omega}_{k}\}_{k=1}^{j-1}$ fixed), as

[TABLE]

We recall the definition and properties of Loewner partial ordering [25, Section 7.7]. Let $\mathbf{M},\mathbf{N}\in\mathbb{R}^{n\times n}$ be symmetric; $\mathbf{M}\preceq\mathbf{N}$ means $\mathbf{N}-\mathbf{M}$ is positive semidefinite. For $\mathbf{S}\in\mathbb{R}^{n\times m}$ , then $\mathbf{S}^{\top}\mathbf{MS}\preceq\mathbf{S}^{\top}\mathbf{NS}$ . Furthermore, $\lambda_{i}(\mathbf{M})\leq\lambda_{i}(\mathbf{N})$ for $i=1,\dots,n$ . Since $\mathbf{Z}_{j}$ has orthonormal columns, $\mathbf{Z}_{j}\mathbf{Z}_{j}^{\top}$ is a projector so that

[TABLE]

and the singular values of $\mathbf{\Gamma}_{j}$ , which are squared eigenvalues of $\mathbf{\Gamma}_{j}\mathbf{\Gamma}_{j}^{\top}$ , satisfy

[TABLE]

To summarize, Eqs. 10, 11, 12 and 13 combined give

[TABLE]

The equality follows since the tensor $\boldsymbol{\mathcal{X}}$ is deterministic. Finally, we have by Hölder’s inequality and Eq. 2 that

[TABLE]

*In the general case, when the processing order does not equal $\rho=[1,\dots,d]$ , the proof is similar. We only need to work with the processed order. We omit the details. *

We make several observations regarding Theorem 3.3. First, the upper bound for the error is the same for the R-STHOSVD as for the R-HOSVD (Theorem 3.1). Second, this result says that the error bound for R-STHOSVD is independent of the processing order. This means that while some processing orders may result in more accurate decompositions, every processing order has the same worst-case error bound. Our recommendation is to pick a processing order that minimizes the computational cost; see Section 3.3 for details. Third, the discussion following Theorem 3.1 regarding the choice of the oversampling parameter is applicable to the R-STHOSVD as well.

Extensions

There are several possible extensions of our results. First, we can extend this analysis to develop concentration results that give insight into the tail bounds. These can be obtained by combining our analysis with the results from, e.g., [21, Theorem 5.8]. Second, other distributions of random matrices may be used instead of standard Gaussian random matrices. Examples include Rademacher random matrices, sparse Rademacher random matrices, subsampled randomized Fourier transforms, etc. It is also possible to combine the analysis with the probabilistic bounds for the other decompositions. Typically, expectation bounds of the type presented in Theorem 3.3 and Theorem 3.1 are only possible for the standard Gaussian random matrices and one can only develop tail bounds. We do not pursue these extensions here but may consider them in future work.

3.3 Computational Cost

We discuss the computational costs of the proposed randomized algorithms and compare them against the HOSVD and the STHOSVD algorithms. We make the following assumptions. First, we assume that the tensors are dense and our implementations take no advantage of their structure. Second, we assume that the target ranks in each dimension are sufficiently small, i.e., $r_{j}\ll I_{j}$ so that we can neglect the computational cost of the QR factorization and the truncation steps of the RandSVD algorithm. Third, we assume that the random matrices used in the algorithms are standard Gaussian random matrices. If other distributions are used, the computational cost may be lower. Finally, for the STHOSVD and R-STHOSVD algorithms, we assume that the processing order is $\rho=[1,2,\dots,d].$

The computational cost of both HOSVD and STHOSVD was discussed in [41], and is reproduced in Table 1. In this paper, we also provide an analysis of the computational cost of R-HOSVD and R-STHOSVD, which is summarized in Table 1. The table includes the costs for both a general tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times I_{2}\times\dots\times I_{d}}$ with target rank $\mathbf{r}=(r_{1},r_{2},\dots,r_{d})$ , as well as for the special case when $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I\times I\times\dots\times I}$ with target rank $\mathbf{r}=(r,r,\dots,r)$ . For ease of notation, denote the product $\prod_{k=i}^{j}I_{k}$ by $I_{i:j}$ , and similarly $\prod_{k=i}^{j}r_{k}=r_{i:j}$ for $1\leq i\leq j\leq d$ . The dominant costs of each algorithm lie in computing the SVD of the unfoldings (the first term in each summation) and forming the core tensor (the second term in each summation).

Processing order

In all the previous analyses for the R-STHOSVD algorithm, we took the processing order of modes to be $\rho=[1,2,\dots,d]$ . The error in the approximation depends on the choice of the processing order; however, Theorem 3.3 suggests that the worst case error is independent of the processing mode. For this reason, we choose a processing order that minimizes the computational cost. Since the dominant cost at each step $j$ is a randomized SVD with a cost of $\mathcal{O}(r_{\rho_{1}:\rho_{j}}I_{\rho_{j}:\rho_{d}})$ , we can minimize this by choosing to process the largest modes first. That is, we process the modes in order of decreasing mode sizes. Note that this is in contrast to the approach taken by [41] for the STHOSVD, in that they process in the order of increasing mode sizes to minimize the cost of the standard SVD in each step.

4 Adaptive Randomized Tensor Decompositions

In the algorithms described in the previous section, we had to assume prior knowledge of the target rank. This knowledge may not be available, or may be difficult to estimate in practice. Given a tensor $\boldsymbol{\mathcal{X}}$ , it is often desirable to produce a decomposition $\widehat{\boldsymbol{\mathcal{X}}}$ that satisfies

[TABLE]

where $0<\varepsilon<1$ is a user-defined parameter. Note that there may not be a unique tensor $\widehat{\boldsymbol{\mathcal{X}}}$ that satisfies this inequality, but it is desirable to find a tensor with a small multirank that does satisfy this inequality. We first explain the adaptive randomized algorithm to find a low-rank matrix approximation, and explain how this can be extended to the tensor case.

Several adaptive randomized range finders have been proposed in the literature [24, 45]. Given a matrix $\mathbf{X}$ and a tolerance $\varepsilon>0$ , the goal is to find a matrix $\mathbf{Q}$ with orthonormal columns that satisfies

[TABLE]

The number of columns of $\mathbf{Q}$ is taken to be the rank of the low rank approximation. The adaptive algorithms begin with a small number of columns of the random matrix $\mathbf{\Omega}$ to estimate the range $\mathbf{Q}$ and then sequentially increase the number of columns of $\mathbf{\Omega}$ until the matrix $\mathbf{Q}$ satisfies Eq. 14. In our paper, we use a version of the adaptive randomized range finding algorithm first proposed by [33] and refined in [45]. We denote the result of this algorithm as $\mathbf{Q}=\text{AdaptRangeFinder}(\mathbf{X},\varepsilon,b)$ , where $\mathbf{X}$ is the matrix to be approximated, $\varepsilon$ is the requested relative error tolerance, and $b$ is a blocking integer to determine how many columns of $\mathbf{\Omega}$ to draw at a time.

Adaptive R-HOSVD

We now explain how the adaptive range finder can be used for computing tensor factorizations. For the R-HOSVD, we apply this adaptive matrix algorithm to each mode unfolding $\mathbf{X}_{(j)}$ . This gives a factor matrices $\mathbf{A}_{j}$ for each mode $j=1,\dots,d$ . Given some tolerance $\varepsilon$ , the approximation error $\|\boldsymbol{\mathcal{X}}-\widehat{\boldsymbol{\mathcal{X}}}\|_{F}\leq\varepsilon\|\boldsymbol{\mathcal{X}}\|_{F}$ can be achieved if we choose the factor matrices $\mathbf{A}_{j}$ to satisfy

[TABLE]

Thus, we have apportioned an equal amount of error tolerance to each mode unfolding. Combined with Lemma 2.1, this ensures that an overall relative error $\varepsilon$ is achieved. This approach is summarized in Algorithm 4. Suppose we want a more flexible approach, in which a different tolerance $\epsilon_{j}$ is chosen for mode $j=1,\dots,d$ . This may be necessary, if we know in advance that we want to avoid compressing along some modes. In general, we may use any sequence $\epsilon_{j}$ , so long as it satisfies $(\sum_{j=1}^{d}\epsilon_{j}^{2})=\varepsilon^{2}$ . Indeed, setting $\epsilon_{j}=0$ for selected modes ensures that no compression is performed across those modes. Note that the choice $\epsilon_{j}=\varepsilon/\sqrt{d}$ for $j=1,\dots,d$ automatically satisfies this equality.

Adaptive R-STHOSVD

The same approach can be extended to the R-STHOSVD algorithm. We define the intermediate tensors $\boldsymbol{\mathcal{X}}^{(j)}=\boldsymbol{\mathcal{X}}\bigtimes_{i=1}^{j}\mathbf{A}_{i}\mathbf{A}_{i}^{\top}$ for $j=1,\dots,d$ and $\boldsymbol{\mathcal{X}}^{(0)}=\boldsymbol{\mathcal{X}}$ . Analogously, we define the intermediate core tensor $\boldsymbol{\mathcal{G}}^{(j)}=\boldsymbol{\mathcal{X}}\bigtimes_{i=1}^{j}\mathbf{A}_{i}^{\top}$ with $\boldsymbol{\mathcal{G}}^{(0)}=\boldsymbol{\mathcal{G}}$ . Furthermore, we choose the processing order $\rho=[1,2,\dots,d]$ for simplicity. In this case, we choose the factor matrices $\mathbf{A}_{j}$ in order to ensure that the successive iterates satisfy

[TABLE]

Applying the first part of Lemma 2.1, we obtain

[TABLE]

The details are provided in Algorithm 5, which uses a general processing order. As with R-HOSVD, we may elect to use the same error tolerance $\varepsilon/\sqrt{d}$ for each mode unfolding, or use a different tolerance $\epsilon_{j}$ for iterate $j=1,\dots,d$ .

5 Structure-preserving decompositions

In this section, we are interested in computing a low-rank decomposition in the Tucker format in which the core tensor $\boldsymbol{\mathcal{G}}\in\mathbb{R}^{r_{1}\times\dots\times r_{d}}$ has entries that are explicitly taken from the original tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\dots\times I_{d}}$ . That is, $\boldsymbol{\mathcal{X}}\approx\boldsymbol{\mathcal{G}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}$ where $\mathbf{A}_{j}\in\mathbb{R}^{I_{j}\times r_{j}}$ with $j=1,\dots,d$ are the factor matrices (that do not necessarily have orthonormal columns). We call such a decomposition structure-preserving since the core tensor retains favorable properties (e.g., sparsity, nonnegativity, binary or integer counts) of the original tensor. This generalizes a related decomposition proposed for matrices in [9]. Related to the structure-preserving decompositions, prior work includes a higher-order interpolatory decomposition [34, 14, 31] of the form

[TABLE]

where the matrices $\{\mathbf{C}_{j}\}_{j=1}^{d}$ have entries from the original tensor (specifically, columns selected from the appropriate mode-unfoldings), and † represents the Moore-Penrose pseudoinverse.

5.1 Algorithm

We first explain our algorithm for the matrix $\mathbf{X}$ . We compute a basis $\mathbf{Q}$ using the randomized range finding algorithm. Instead of computing the low-rank approximation $\mathbf{QQ}^{\top}\mathbf{X}$ , as was done in Algorithm 1, we first identify a set of well-conditioned rows of $\mathbf{Q}$ . This is implemented as a selection operator denoted by the matrix $\mathbf{P}$ , which contains columns from the identity matrix. We then use the low-rank representation

[TABLE]

where $\mathbf{A}=\mathbf{Q}(\mathbf{P}^{\top}\mathbf{Q})^{-1}$ and $\widehat{\mathbf{X}}=\mathbf{P}^{\top}\mathbf{X}.$ We see that the matrix $\mathbf{A}$ does not have orthonormal columns, but is well-conditioned, and that the matrix $\widehat{\mathbf{X}}$ contains rows from the matrix $\mathbf{X}$ as determined by the selection operator $\mathbf{P}$ . The selection operator can be determined in a variety of ways: using the discrete empirical interpolation method [38], pivoted QR factorization [16], or strong rank-revealing QR (sRRQR) factorization [22, 17] (which is used in this paper).

The idea behind our algorithms is the following: at each step, given the core tensor $\boldsymbol{\mathcal{G}}$ we first unfold this tensor and use a randomized range finder on the unfolding to obtain a basis $\mathbf{Q}_{j}$ . Then, we use the sRRQR algorithm to determine the selection operator that identifies well-conditioned rows of $\mathbf{Q}_{j}$ to determine the core tensor for the next step. The details of the algorithm are provided in Algorithm 6. A few things are worth noting. First, the core tensor at each step contains elements from the original tensors, so that the final core tensor has entries from the original tensor; this makes the algorithm structure preserving. Second, in contrast to Algorithms 2 and 3, the factor matrices do not have orthonormal columns; if orthonormal columns are desired, a postprocessing step can be performed (a thin-QR factorization of each factor matrix to obtain the basis, followed by an aggregation step in which the core tensor is multiplied with all the triangular factors). Third, the resulting tensor is of rank- $(r_{1}+p,\dots,r_{d}+p)$ , which is also in contrast to other algorithms that produce decompositions of the rank $(r_{1},\dots,r_{d})$ . Once again, these factors can be recompressed using a postprocessing step; see, for example, [29].

Algorithm 6 is particularly beneficial for sparse tensors. Although sparse tensors can be efficiently stored in an appropriate tensor format (e.g., [2]), a straightforward application of either Algorithm 2 or Algorithm 3 produces dense intermediate tensors that may be prohibitively expensive to store, even though the final decomposition may be economical in terms of storage costs. On the other hand, in Algorithm 6, each intermediate core tensor is sparse, and only contains entries from the original tensor. Therefore, the intermediary core tensors can be efficiently stored in the same sparse tensor format. Besides the savings in memory, storing in sparse tensor format is efficient for computational reasons since tensor and matrix products are cheaper to compute.

Computational Cost

For simplicity, we assume a processing order of $\rho=[1,2,\dots,d]$ . The two dominant costs of Algorithm 6 for each mode are obtaining the basis $\mathbf{Q}_{j}$ and computing an sRRQR to determine $\mathbf{P}_{j}$ . Let $\mathsf{nnz}(\boldsymbol{\mathcal{G}}^{(j)})$ denote the number of nonzeros in the core tensor at step $j$ . Letting $\ell_{j}=r_{j}+p$ , the cost of forming the product of the (unfolded) core tensor with a random matrix $\mathbf{\Omega}_{j}$ , over all $d$ modes is $\mathcal{O}(\sum_{j=1}^{d}\mathsf{nnz}(\boldsymbol{\mathcal{G}}^{(j)})\ell_{j})$ . Computing an sRRQR of an $I_{j}\times\ell_{j}$ matrix with parameter $\eta=2$ (parameter was called $f$ in ) costs $\mathcal{O}(I_{j}\ell_{j}^{2})$ per mode. Combining both the dominant costs gives a total cost of $\mathcal{O}\left(\sum_{j=1}^{d}\mathsf{nnz}(\boldsymbol{\mathcal{G}}^{(j)})\ell_{j}+\sum_{j=1}^{d}I_{j}\ell_{j}^{2}\right).$ This analysis shows that the computational cost of SP-STHOSVD is significantly smaller than the other algorithms presented thus far, particularly with a sparse tensor. Even when the original tensor is dense, there is a savings in computational cost compared to STHOSVD since a full SVD is not computed, and compared to R-STHOSVD since the core tensor $\boldsymbol{\mathcal{G}}^{(j)}$ is only multiplied once per iteration. For computational reasons, we still use the processing order in Section 3.3.

5.2 Error Analysis

We now present the error analysis for Algorithm 6. There are two major difficulties here in extending the proofs of Theorems 2.2 and 2.3. First, we have to work with an oblique projector $\mathbf{\Pi}_{j}=\mathbf{Q}_{j}(\mathbf{P}_{j}^{\top}\mathbf{Q}_{j})^{-1}\mathbf{P}_{j}^{\top}$ , whereas in the previous analysis we used an orthogonal projector. As a consequence, we can no longer use the Pythagoras theorem to obtain Lemma 2.1; instead we have to employ the triangle inequality, resulting in a weaker bound. Second, we have to work with the factor matrices $\mathbf{A}_{j}$ which no longer have orthonormal columns. We are able to show the following error bound.

Theorem 5.1.

Let $\widehat{\boldsymbol{\mathcal{X}}}=[\boldsymbol{\mathcal{G}};\mathbf{A}_{1},\dots,\mathbf{A}_{d}]$ be the output of Algorithm 6 with inputs target rank $\mathbf{r}=(r_{1},\dots,r_{d})$ and oversampling parameter $p\geq 2$ such that $p$ satisfies $r_{j}+p<\min\{I_{j},\prod_{i\neq j}I_{i}\}$ for $j=1,\dots,d$ . Furthermore, assume a processing order of $\rho=[1,2,\dots,d]$ , and let $\eta=2$ be the sRRQR parameter. Then, the expected approximation error is

[TABLE]

where $g(I,r)=\sqrt{1+4r(I-r)}$ , and $f_{p}(r)=\sqrt{1+\frac{r}{p-1}}$ . Furthermore, the matrices $\{\mathbf{A}_{j}\}_{j=1}^{d}$ each contain an $\ell_{j}\times\ell_{j}$ identity matrix and

[TABLE]

Proof 5.2.

First, consider the factor matrices $\mathbf{A}_{j}=\mathbf{Q}_{j}(\mathbf{P}_{j}^{\top}\mathbf{Q}_{j})^{-1}$ for $j=1,\dots,d$ . Since $\mathbf{P}_{j}$ contains columns from the identity matrix, it is easy to verify that $\mathbf{A}_{j}$ contains the identity matrix as its submatrix. The lower bound on $\|\mathbf{A}_{j}\|_{2}$ follows immediately from this fact. For the upper bound, since $\mathbf{Q}_{j}$ has orthonormal columns almost surely,

[TABLE]

The last step follows from [17, Lemma 2.1], where $\eta=2$ is used as the tuning parameter for the sRRQR algorithm.

Next, let $\boldsymbol{\mathcal{G}}^{(j)}=\boldsymbol{\mathcal{X}}\bigtimes_{i=1}^{j}\mathbf{P}_{i}^{\top}$ denote the partially truncated core tensor after the $j$ -th step of Algorithm 6. Also let $\widehat{\boldsymbol{\mathcal{X}}}^{(j)}=\boldsymbol{\mathcal{G}}^{(j)}\bigtimes_{i=1}^{j}\mathbf{A}_{i}$ be the $j$ -th partial approximation of $\boldsymbol{\mathcal{X}}$ . Then, by the triangle inequality and the linearity of expectations, we have

[TABLE]

Consider the term $\|\boldsymbol{\mathcal{\hat{X}}}^{(j-1)}-\boldsymbol{\mathcal{\hat{X}}}^{(j)}\|_{F}.$ We can simplify $\boldsymbol{\mathcal{\hat{X}}}^{(j)}$ as

[TABLE]

Therefore, $\boldsymbol{\mathcal{\hat{X}}}^{(j-1)}-\boldsymbol{\mathcal{\hat{X}}}^{(j)}=\boldsymbol{\mathcal{G}}^{(j-1)}\bigtimes_{i=1}^{j-1}\mathbf{A}_{i}\times_{j}(\mathbf{I}-\mathbf{\Pi}_{j})$ . Repeated use of the submultiplicativity $\|\mathbf{MN}\|_{F}\leq\|\mathbf{M}\|_{F}\|\mathbf{N}\|_{2}$ ,

[TABLE]

Now, observe that $\mathbf{\Pi}_{j}\mathbf{Q}_{j}\mathbf{Q}_{j}^{\top}=\mathbf{Q}_{j}\mathbf{Q}_{j}^{\top}$ , implying that $\mathbf{I}-\mathbf{\Pi}_{j}=(\mathbf{I}-\mathbf{\Pi}_{j})(\mathbf{I}-\mathbf{Q}_{j}\mathbf{Q}_{j}^{\top})$ . Therefore, once again using submultiplicativity

[TABLE]

We note that $\mathbf{\Pi}_{j}\neq\mathbf{I}$ since $\mathsf{rank}(\mathbf{\Pi}_{j})\leq\mathsf{rank}(\mathbf{Q}_{j})=r_{j}+p<I_{j}$ , and $\mathbf{\Pi}_{j}\neq\mathbf{0}$ since $\mathbf{Q}_{j}$ has orthonormal columns (almost surely) and $\mathbf{P}_{j}^{\top}\mathbf{Q}_{j}$ is invertible. Therefore, using [39] $\|\mathbf{I}-\mathbf{\Pi}_{j}\|_{2}=\|\mathbf{\Pi}_{j}\|_{2}$ . Once again using [17, Lemma 2.1]

[TABLE]

Combining this inequality with Eqs. 15, 18 and 17, we have

[TABLE]

Taking expectations, and using the independence of the random matrices, we obtain

[TABLE]

In the last step, we have used Eq. 5 and kept the random matrices $\{\mathbf{\Omega}_{k}\}_{i=1}^{j-1}$ fixed. By construction $\mathbf{G}_{(j)}^{(j-1)}$ is a submatrix of the mode-unfolding $\mathbf{X}_{(j)}$ . Arguing as in the proof of Theorem 3.3, we can show $\sum_{i=r_{j}+1}^{I_{j}}\sigma_{i}^{2}(\mathbf{G}_{(j)}^{(j-1)})\leq\Delta_{j}^{2}(\boldsymbol{\mathcal{X}})$ , for $j=1,\dots,d$ and, therefore,

[TABLE]

Plugging this into Eq. 16, and using Eq. 4 we get the desired bound.

In this result, we have assumed the standard processing order $\rho=[1,\dots,d]$ . The analysis can be extended to other processing orders, but we omit a detailed statement here. However, note that the upper bound derived here is dependent on the processing order, which is in contrast to the bound in Theorem 3.3. Although the error bound in Theorem 5.1 can be much higher than Theorem 3.3, numerical results suggest that the bounds are somewhat pessimistic and, in practice, Algorithm 6 produces accurate decompositions. Two other features are worth mentioning. First, the core tensor $\boldsymbol{\mathcal{G}}$ contains $\prod_{j=1}^{d}\ell_{j}$ entries from the original tensor $\boldsymbol{\mathcal{X}}$ . Second, while each factor matrix may not be orthonormal, they are close to orthonormal and, in fact, explicitly contain the identity matrix as a submatrix.

5.3 Variants

One can easily develop several variations of the Algorithm 6 that may be beneficial in applications.

1. SP-HOSVD.

In Algorithm 6, we handled all the modes sequentially; they can however be handled independently. For each mode $j=1,\dots,d$ , we obtain a basis $\mathbf{Q}_{j}$ and a selection operator $\mathbf{P}_{j}$ . The resulting decomposition is of the form $\boldsymbol{\mathcal{X}}\approx\widehat{\boldsymbol{\mathcal{X}}}=\boldsymbol{\mathcal{G}}\bigtimes_{j=1}^{d}\mathbf{A}_{j}$ where the core tensor $\boldsymbol{\mathcal{G}}=\boldsymbol{\mathcal{X}}\bigtimes_{j=1}^{d}\mathbf{P}_{j}^{\top}$ and the factor matrices are of the form $\mathbf{A}_{j}=\mathbf{Q}_{j}(\mathbf{P}_{j}^{\top}\mathbf{Q}_{j})^{-1}$ for $j=1,\dots,d$ . The error analysis is similar to Theorem 5.1 and we found that it has the same upper bound.

2. Range finding.

We used a basic version of the randomized range finding algorithm to obtain the matrices $\mathbf{Q}_{j}$ . Other variations are certainly possible; for example, the adaptive range finding algorithm Section 4, randomized subspace iteration [24, Algorithm 4.3], or other deterministic or randomized rank-revealing decompositions.

3. Subset selection.

In Algorithm 6, we used the sRRQR algorithm for the subset selection step. In practice, this is computationally expensive, and an alternative is to use the Column-Pivoted QR factorization. This algorithm has lower computational cost but is known to fail for certain adversarial cases [22]. Besides deterministic algorithms for subset selection, there are several randomized techniques available, such as uniform sampling and leverage score sampling that can be used instead of sRRQR.

6 Numerical Results

In this section, we study the accuracy and the computational cost of our algorithms on several synthetic and real-world tensors. The main result we wish to verify here is that randomizing the preexisting compression algorithms decreases the computational time while not significantly increasing the error. All results were run on a desktop with a $3.4$ GHz Intel Core i7 processing unit and 16GB memory. We used two tensor packages in matlab, namely Tensor Toolbox [3] for handling sparse tensors, and Tensorlab [44] for everything else.

6.1 Test Problems

We briefly describe the four different sources of tensors that we use to validate our algorithms.

1. Hilbert Tensor

Our first test tensor is a synthetic, super-symmetric tensor (invariant under the permutation of indices), where each entry is defined as

[TABLE]

We call this the Hilbert tensor, which generalizes the Hilbert matrix ( $d=2$ ). When $d=5$ and each $I_{j}=25$ , this tensor has $25^{5}=9,765,625$ nonzero entries. The structure of this tensor is such that the singular values of each mode-unfolding decay rapidly, which indicates that the randomized algorithms proposed in this paper are likely to be accurate.

2. Synthetic Sparse tensor

For our second example, we construct a three-dimensional sparse tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{200\times 200\times 200}$ as the sum of outer products as

[TABLE]

where $\mathbf{x}_{i},\mathbf{y}_{i},\mathbf{z}_{i}\in\mathbb{R}^{n}$ are sparse vectors for all $i$ , and $\circ$ denotes the outer product. The sparse vectors are all generated using the prand command with $5\%$ nonzero. This results in the tensor $\boldsymbol{\mathcal{X}}$ having $185,211$ nonzeros total. Furthermore, $\gamma$ is a user-defined parameter which determines the strength of the gap between the first ten terms and the last terms.

3. Olivetti Dataset

The classification of facial images, or “tensorfaces” as popularized by [42], has two main steps. The first is the compression phase, where a higher order SVD is applied to the tensor for decomposition. The second step is the classification process, in which the decomposed tensor is used to classify new images. We focus on the first step to efficiently decompose the tensor of images from the Olivetti dataset [1] using the proposed randomized algorithms. This dataset contains $400$ images ( $64\times 64$ pixels) of $40$ people in $10$ different poses. This set of images can be expressed as a three dimensional tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{40\times 4096\times 10}$ , in which the three modes represent people, pixels, and poses, respectively.

4. FROSTT database

Our final test problems come from the formidable repository of sparse tensors and tools (FROSTT) database [37]. From this database, we choose two representative large, sparse tensors whose features are summarized in Table 2.

The NELL-2 dataset [7] is a portion of the Never Ending Language Learning knowledge base from the “Read the Web” project at Carnegie Mellon University. NELL is a machine learning system that relates different entities, creating a three-dimensional dataset whose modes represent entity, relation, and entity. The Enron dataset [36] contains word counts in emails released during an investigation by the Federal Energy Regulatory Commission. Here, the modes represent sender, receiver, word, and date, respectively.

Although our implementation of SP-STHOSVD is capable of handling both full tensors in Table 2, they are unfortunately too large to compute the approximation error. Then, in order to compute this error, we subsample the tensors first. This also allows us to compare the performance of our SP-STHOSVD algorithm with others that could not handle the size of the full datasets. For the NELL-2 dataset [7], we subsample every 15 elements to obtain a tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{807\times 613\times 1922}$ . For the Enron dataset [36], we also need to subsample, but even with subsampling, the tensor is still too sparse to accurately compute the approximation error. To address this, we also condense the dataset to three dimensions by summing over the fourth mode. Then to subsample, we take every 15 elements from the first two modes, and every 25 from the third. This results in a tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{405\times 380\times 9771}$ .

6.2 Numerical Experiments

We now describe the experiments performed on the test tensors introduced in the previous subsection.

6.2.1 Fixed rank

Our first experiment compares the accuracy of the HOSVD and STHOSVD algorithms with their randomized counterparts, R-HOSVD and R-STHOSVD (Algorithms 2 and 3). As inputs, we take $\boldsymbol{\mathcal{X}}$ as defined in Eq. 19 with $d=5$ modes and $I_{j}=25$ for $j=1,\dots,d$ . For each algorithm, we use the target rank $(r,r,r,r,r)$ , where $r$ varies from $1$ to $25$ , and the same oversampling parameter $p=5$ was used in every mode. Since this is a super-symmetric tensor, the processing order of modes does not affect the results, so we take the processing order $\rho=[1,2,3,4,5]$ . The relative error is plotted in Fig. 1, where we can see that the approximation error of all four algorithms is very similar and that the randomized algorithms are highly accurate.

The comparison of the cost in Section 3.3 implies that the proposed randomized algorithms are less expensive. To illustrate this, we report the runtime of the HOSVD, R-HOSVD, STHOSVD, and R-STHOSVD algorithms on $\boldsymbol{\mathcal{X}}$ as the size of each dimension increases. For inputs, we fixed the target rank to be $(5,5,5,5,5)$ , the oversampling parameter as $p=5$ , and we used processing order $\rho=[1,2,3,4,5]$ in the sequential algorithms. The runtime in seconds, averaged over three runs, is shown in Table 3. The theory implies that the randomized algorithms should be a factor of $n/(r+p)=2.5$ faster than the non-randomized algorithms. This is evident in our results. Also, the sequential algorithms are significantly faster than the HOSVD/R-HOSVD algorithms.

We now compare the computed results to the theoretical bounds shown in Eqs. 7 and 9 for the R-HOSVD and R-STHOSVD algorithms. Since the upper bound for both algorithms is the same, we display this bound only once. Running both algorithms with target rank $(r,r,r,r,r)$ as $r$ increases, oversampling parameter $p=5$ , and processing order $\rho=[1,2,3,4,5]$ for R-STHOSVD, we can see in the right-hand plot of Fig. 1 that they are similar to each other and the computed bound. This closeness shows that the theoretical bounds capture the actual error well.

Next, we apply the randomized algorithms to the Olivetti dataset. The results are described in Fig. 2. As all the modes have different dimensions, we only compress the largest (the pixels) to start. The first plot shows the standard algorithms, but there is a noticeable difference in the error for the randomized and standard algorithms. This can be explained by the fact that the singular values of the data do not decay sufficiently quickly. To fix this, we add one step of subspace iteration (see [24] for details) to the randomized SVD, giving the second plot. Here the difference is almost nonexistent.

Now we consider the relative error if compressing two modes of the Olivetti dataset, namely the people and pixels (modes 1 and 2). We just compare the sequential algorithms here, as they are faster to run. The heat plots in Fig. 3 show the results. We also compare this rank and error pair to that from the adaptive randomized STHOSVD algorithm (Algorithm 5). For all algorithms, the oversampling parameter was $p=5$ , and the processing order was $\rho=[2,1,*]$ . The third mode is not compressed, indicated here by the asterisk. We will elaborate on the adaptive algorithms in the next section.

6.2.2 Adaptive algorithms

We now evaluate our adaptive Algorithms 4 and 5 to compare the ranks given by inputing different relative error tolerances as the size of each dimension increases. Taking the Hilbert tensor $\boldsymbol{\mathcal{X}}$ defined in Eq. 19 with $d=3$ , we give as input relative error tolerance $\epsilon$ , processing order $\rho=[1,2,3]$ , and blocking integer $b=1$ . The size of the core tensor obtained from the adaptive STHOSVD algorithm is shown in Table 4. In the $I_{j}=25,\,j=1,2,3$ case, we can see that the error and corresponding rank are close to those shown in the left-hand plot of Fig. 1.

Next, we consider the Olivetti dataset. To compare our randomized algorithms to the standard algorithms, we give a desired relative error tolerance $\epsilon$ to the adaptive R-STHOSVD and find the size of the core tensor. We then compare the actual error from that rank to the error from STHOSVD run with the same rank, and display the results in Table 5. The theory implies that we should see a smaller error from the STHOSVD results, and we observe that the error is only slightly smaller. Both algorithms were run with processing order $\rho=[2,1,3]$ .

6.2.3 Algorithms for Sparse Tensors

We now test our algorithms on sparse tensors. First consider the synthetic sparse tensor $\boldsymbol{\mathcal{X}}$ defined in Eq. 20. For three different $\gamma$ values $\gamma=2,10,200$ , we compare the SP-STHOSVD algorithm to the STHOSVD and R-STHOSVD algorithms by plotting the relative error as the target rank $(r,r,r)$ increases. Note that we are only comparing to the sequential algorithms in Fig. 4. This is because we have already shown the HOSVD and R-HOSVD algorithms to have a similar error with a higher cost in a problem of this size. As inputs to our test algorithms, we used oversampling parameter $p=5$ and processing order $\rho=[1,2,3]$ . We can see that the error for the sparse algorithm is slightly higher for the lower $r$ values, which is better than the expected result given the error bound in Theorem 5.1.

We now test our SP-STHOSVD algorithm on the real-world sparse tensors. Note that this algorithm oversamples but does not truncate, meaning that the rank of the resulting approximation given target rank $(r,r,r)$ will be $(r+p,r+p,r+p)$ . Then to compare the approximations of SP-STHOSVD to any of the other fixed rank algorithms presented thus far, we must use $(r+p,r+p,r+p)$ as the target rank for the fixed rank algorithms.

To compare our algorithms, we ran the SP-STHOSVD and the R-STHOSVD algorithms on the condensed NELL-2 tensor (details in Table 2) with inputs oversampling parameter $p=5$ , and processing order $\rho=[3,1,2]$ as the target rank $(r,r,r)$ increased. The relative errors obtained are shown in Table 6. We can see that the error for SP-STHOSVD is higher than that of the R-STHOSVD, but this was anticipated from the theory. We also compare the runtime of these two algorithms averaged over three runs to see their respective costs, which are also shown in Table 6.

For the Enron dataset [36], we repeat the previous experiment, comparing the relative error and runtime of the SR-STHOSVD and R-STHOSVD as the target rank increases. The results are in Table 6. We see similar results as before, in that the error for SP-STHOSVD is higher than that of the R-STHOSVD, but the runtime is far less.

7 Conclusion

In this paper, we makes several contributions in terms of new randomized algorithms and analysis for low-rank tensor decomposition in the Tucker format. Specifically, we proposed adaptive algorithms for problems where the target rank is not known beforehand and an algorithm that preserves the structure of the original tensor. We also provided probabilistic analysis of randomized compression algorithms, R-HOSVD and R-STHOSVD, as well as analysis for the newly proposed algorithms. We showed, through the analysis and numerical examples that using randomized techniques still allows for accurate approximations to tensors, and that the approximation error is comparable to deterministic algorithms, with much lower computational costs.

8 Acknowledgements

The authors would like to thank Ilse Ipsen for reading through the paper and giving useful feedback.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] AT&T Laboratories at Cambridge. Olivetti database of faces, 2002.
2[2] B. W. Bader and T. G. Kolda. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing , 30(1):205–231, December 2007.
3[3] B. W. Bader and T. G. Kolda. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing , 30(1):205–231, Dec. 2007.
4[4] K. Batselier, W. Yu, L. Daniel, and N. Wong. Computing low-rank approximations of large-scale matrices with the tensor network randomized SVD. SIAM Journal on Matrix Analysis and Applications , 39(3):1221–1244, 2018.
5[5] C. Battaglino, G. Ballard, and T. G. Kolda. A practical randomized CP tensor decomposition. SIAM Journal on Matrix Analysis and Applications , 39(2):876–901, 2018.
6[6] D. J. Biagioni, D. Beylkin, and G. Beylkin. Randomized interpolative decomposition of separated representations. Journal of Computational Physics , 281:116–134, 2015.
7[7] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI , volume 5, page 3, 2010.
8[8] M. Che and Y. Wei. Randomized algorithms for the approximations of Tucker and the Tensor Train decompositions. Advances in Computational Mathematics , pages 1–34, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Randomized Algorithms for Low-rank Tensor Decompositions in the Tucker Format††thanks: \fundingThe authors would like to acknowledge the support of the National Science Foundation through the grants DMS-1821148 (M.E.K) and DMS-1821149 (R.M. and A.K.S).

Abstract

1 Introduction

Contributions and Contents

Related Work

2 Background

2.1 Notation and preliminaries

Tensor product

Lemma 2.1**.**

Tucker representation

Kronecker products

2.2 HOSVD/STHOSVD

HOSVD

Theorem 2.2**.**

STHOSVD

Theorem 2.3**.**

2.3 Best Approximation

2.4 Randomized SVD

Theorem 2.4**.**

Remark 2.5**.**

3 Randomized HOSVD/STHOSVD

3.1 Algorithms

3.2 Error Analysis

Theorem 3.1** (Randomized HOSVD).**

Proof 3.2**.**

Theorem 3.3** (Randomized STHOSVD).**

Proof 3.4**.**

Extensions

3.3 Computational Cost

Processing order

4 Adaptive Randomized Tensor Decompositions

Adaptive R-HOSVD

Adaptive R-STHOSVD

5 Structure-preserving decompositions

5.1 Algorithm

Computational Cost

5.2 Error Analysis

Theorem 5.1**.**

Proof 5.2**.**

5.3 Variants

6 Numerical Results

6.1 Test Problems

1. Hilbert Tensor

2. Synthetic Sparse tensor

3. Olivetti Dataset

4. FROSTT database

6.2 Numerical Experiments

6.2.1 Fixed rank

6.2.2 Adaptive algorithms

6.2.3 Algorithms for Sparse Tensors

7 Conclusion

8 Acknowledgements

Lemma 2.1.

Theorem 2.2.

Theorem 2.3.

Theorem 2.4.

Remark 2.5.

Theorem 3.1 (Randomized HOSVD).

Proof 3.2.

Theorem 3.3 (Randomized STHOSVD).

Proof 3.4.

Theorem 5.1.

Proof 5.2.