A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix   Factorization

Zhihuai Chen; Yinan Li; Xiaoming Sun; Pei Yuan; Jialin Zhang

arXiv:1907.05568·cs.DS·July 15, 2019

A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization

Zhihuai Chen, Yinan Li, Xiaoming Sun, Pei Yuan, Jialin Zhang

PDF

TL;DR

This paper introduces a new classical algorithm for separable Non-negative Matrix Factorization that is inspired by quantum techniques, achieving exponential speedup for large-scale, low-rank datasets.

Contribution

It presents a polynomial-time classical algorithm for separable NMF inspired by quantum dequantization methods, enabling efficient processing of large datasets.

Findings

01

Runs in polynomial time in rank and logarithmic in input size

02

Achieves exponential speedup in low-rank scenarios

03

Applicable to large-scale text and image data

Abstract

Non-negative Matrix Factorization (NMF) asks to decompose a (entry-wise) non-negative matrix into the product of two smaller-sized nonnegative matrices, which has been shown intractable in general. In order to overcome this issue, the separability assumption is introduced which assumes all data points are in a conical hull. This assumption makes NMF tractable and is widely used in text analysis and image processing, but still impractical for huge-scale datasets. In this paper, inspired by recent development on dequantizing techniques, we propose a new classical algorithm for separable NMF problem. Our new algorithm runs in polynomial time in the rank and logarithmic in the size of input matrices, which achieves an exponential speedup in the low-rank setting.

Figures1

Click any figure to enlarge with its caption.

Equations50

W \in R_{\geq 0}^{m \times k} H \in R_{\geq 0}^{n \times k} min A - W H^{T}_{F} .

W \in R_{\geq 0}^{m \times k} H \in R_{\geq 0}^{n \times k} min A - W H^{T}_{F} .

A \hat{V} \hat{V}^{T} - A_{F}^{2} \leq D \mathchar 58 r ank (D) \leq k min ∥ A - D ∥_{F}^{2} + ϵ ∥ A ∥_{F}^{2} .

A \hat{V} \hat{V}^{T} - A_{F}^{2} \leq D \mathchar 58 r ank (D) \leq k min ∥ A - D ∥_{F}^{2} + ϵ ∥ A ∥_{F}^{2} .

A \hat{V} \hat{V}^{T} - A_{F} \leq ϵ ∥ A ∥_{F} .

A \hat{V} \hat{V}^{T} - A_{F} \leq ϵ ∥ A ∥_{F} .

V - \hat{V}_{F} \leq α / 2 + c_{1} α^{2},

V - \hat{V}_{F} \leq α / 2 + c_{1} α^{2},

Π_{\hat{V}} - \hat{V} \hat{V}^{T}_{F} \leq c_{2} α,

O (p o l y (k, κ, lo g \frac{1}{δ}, lo g (mn))) .

O (p o l y (k, κ, lo g \frac{1}{δ}, lo g (mn))) .

span {\hat{V}^{(i)} i \in [l], l \leq k} = span {A_{(i)} i \in [m]} .

span {\hat{V}^{(i)} i \in [l], l \leq k} = span {A_{(i)} i \in [m]} .

A x - A \hat{V} \hat{V}^{T} x \leq ϵ ∥ A ∥_{F} \leq ϵ k κ σ_{m i n} (A) .

A x - A \hat{V} \hat{V}^{T} x \leq ϵ ∥ A ∥_{F} \leq ϵ k κ σ_{m i n} (A) .

∥ D_{A V x}, O ∥_{T V}

∥ D_{A V x}, O ∥_{T V}

M - \tilde{M}_{F} \leq ζ ∥ A ∥_{F} ∥ L ∥_{F} ∥ R ∥_{F}

M - \tilde{M}_{F} \leq ζ ∥ A ∥_{F} ∥ L ∥_{F} ∥ R ∥_{F}

M - \tilde{M}_{F}^{2}

M - \tilde{M}_{F}^{2}

\leq i \in [k_{1}], j \in [k_{2}] \sum ζ^{2} ∥ A ∥_{F}^{2} L_{(i)}^{2} R^{(j)}^{2}

= ζ^{2} ∥ A ∥_{F}^{2} i \in [k_{1}] \sum L_{(i)}^{2} j \in [k_{2}] \sum R^{(j)}^{2}

= ζ^{2} ∥ A ∥_{F}^{2} ∥ L ∥_{F}^{2} ∥ R ∥_{F}^{2} .

A V x - A \hat{V} x \leq ∥ A ∥_{F} V - \hat{V}_{F} ∥ x ∥ \leq (α_{V} / 2 + c_{1} α_{V}^{2}) ∥ A ∥_{F} .

A V x - A \hat{V} x \leq ∥ A ∥_{F} V - \hat{V}_{F} ∥ x ∥ \leq (α_{V} / 2 + c_{1} α_{V}^{2}) ∥ A ∥_{F} .

A V x - A \hat{V} x \leq (α_{V} / 2 + c_{1} α_{V}^{2}) k κ ∥ A V x ∥ .

A V x - A \hat{V} x \leq (α_{V} / 2 + c_{1} α_{V}^{2}) k κ ∥ A V x ∥ .

A \hat{V} x - \hat{U} \hat{U}^{T} A \hat{V} x

A \hat{V} x - \hat{U} \hat{U}^{T} A \hat{V} x

=

=

\leq

\hat{U} \hat{U}^{T} A \hat{V} x \geq (1 - \frac{ϵ}{8}) A \hat{V} x \geq (1 - \frac{ϵ}{8})^{2} ∥ A V x ∥ .

\hat{U} \hat{U}^{T} A \hat{V} x \geq (1 - \frac{ϵ}{8}) A \hat{V} x \geq (1 - \frac{ϵ}{8})^{2} ∥ A V x ∥ .

\hat{U}^{T} A \hat{V} - \tilde{M}_{F} \leq ζ \hat{U}_{F} \hat{V}_{F} ∥ A ∥_{F} \leq ζ k^{1.5} κ (1 + \frac{α _{U}}{k}) (1 + \frac{α _{V}}{k}) ∥ A V x ∥ .

\hat{U}^{T} A \hat{V} - \tilde{M}_{F} \leq ζ \hat{U}_{F} \hat{V}_{F} ∥ A ∥_{F} \leq ζ k^{1.5} κ (1 + \frac{α _{U}}{k}) (1 + \frac{α _{V}}{k}) ∥ A V x ∥ .

\hat{U} \hat{U}^{T} A \hat{V} x - \hat{U} \tilde{M} x

\hat{U} \hat{U}^{T} A \hat{V} x - \hat{U} \tilde{M} x

\leq

\leq

\leq

D_{\hat{U} \tilde{y}}, O_{T V} \leq \frac{α _{U}}{1 - α _{U}} < \frac{ϵ}{4} .

D_{\hat{U} \tilde{y}}, O_{T V} \leq \frac{α _{U}}{1 - α _{U}} < \frac{ϵ}{4} .

ar g i max {N_{i} ∣1 \leq i \leq N} = ar g i max {p_{i} ∣1 \leq i \leq m} .

ar g i max {N_{i} ∣1 \leq i \leq N} = ar g i max {p_{i} ∣1 \leq i \leq m} .

p_{i}^{*} = β Pr (i = argmax_{i} {(A β)_{i}}) .

p_{i}^{*} = β Pr (i = argmax_{i} {(A β)_{i}}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: 1CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China

2University of Chinese Academy of Sciences, 100049, Beijing, China

3Centrum Wiskunde & Informatica and QuSoft, Science Park 123, 1098XG Amsterdam, Netherlands

{chenzhihuai,sunxiaoming,yuanpei}@ict.ac.cn,[email protected]

A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization

Zhihuai Chen1,2

Yinan Li3

Xiaoming Sun1,2

Pei Yuan1,2

Jialin Zhang1,2

Abstract

Non-negative Matrix Factorization (NMF) asks to decompose a (entry-wise) non-negative matrix into the product of two smaller-sized nonnegative matrices, which has been shown intractable in general. In order to overcome this issue, separability assumption is introduced which assumes all data points are in a conical hull. This assumption makes NMF tractable and is widely used in text analysis and image processing, but still impractical for huge-scale datasets. In this paper, inspired by recent development on dequantizing techniques, we propose a new classical algorithm for separable NMF problem. Our new algorithm runs in polynomial time in the rank and logarithmic in the size of input matrices, which achieves an exponential speedup in the low-rank setting.

1 Introduction

Non-negative Matrix Factorization (NMF) aims to approximate a non-negative data matrix ${A}\in\mathbb{R}_{\geq 0}^{m\times n}$ by the product of two non-negative low rank factors, i.e., ${A}\approx WH^{T}$ , where $W\in\mathbb{R}_{\geq 0}^{m\times k}$ is called basis matrix, $H\in\mathbb{R}_{\geq 0}^{n\times k}$ is called encoding matrix and $k\ll\min\{m,n\}$ . In many applications, an NMF often results in more natural and interpretable part-based decomposition of data [LS99]. Therefore, NMF has been widely used in a number of practical applications, such as topic modeling in text, signal separation, social network, collaborative filtering, dimension reduction, sparse coding, feature selection and hyperspectral image analysis. Since computing an NMF is NP-hard [Vav09], a series of heuristic algorithms have been proposed [LS01, Lin07, HD11, KP08, DLJ10, GTLST12]. All of the heuristic algorithms aim to minimize the reconstruction error, the formula which is a non-convex program and lack optimality guarantee:

[TABLE]

A natural assumption on the data called separability assumption, was observed in [DS04] . From a geometry perspective, the separable assumption means that all rows of ${A}$ reside in a cone generated by a rather smaller number of rows. In particular, these generators are called anchors of ${A}$ . To solve the Separable Non-Negative Matrix Factorizations (SNMF), it is sufficient to identify the anchors in the input matrices, which can be solved in polynomial time [AGKM12, AGM12, GV14, EMO*+*12, ESV12, ZBT13, ZBG14]. Separability assumption is favored by various practical applications. For example, in the unmixing task in hyperspectral imaging, separability implies the existence of ‘pure’ pixel [GV14]. And in the topic detection task, it also means some words are associated with unique topic [Hof17]. In huge datasets, it is useful to pick up some representative data points to stand for other points. Such ‘self-expression’ assumption helps to improve the data analysis procedure [MD09, EV09].

1.1 Related work

It is natural to assume all the rows of the input ${A}$ has unit $\ell_{1}$ -norm, since $\ell_{1}$ -normalization translates the conical hull to convex hull while keeping the anchors unchanged. From this perspective, most algorithms essentially identify the extreme points in the convex hull of the ( $\ell_{1}$ -normalized) data vectors. In [AGKM12], the authors use $m$ linear programs in $O(m)$ variables to identify the anchors out of $m$ data points, and it is therefore not suitable for dealing with large-scale real-world problems. Furthermore, [RRTB12] presents a single LP in $n^{2}$ variables for SNMF to deal with large-scale problems (but is still impractical for huge-scale problems).

There is another class of algorithms based on greedy algorithms. The main idea is to opt a data point on the direction where the current residual decreases fast. The algorithms terminate with a sufficiently small error or a large iteration times. For example, Successful Projection Algorithm (SPA) [GV14] derives from Gram-Schmidt orthogonalization with row or column pivoting. XRAY [KSK13] detects a new anchor referring to the residual of exterior data points and updates the residual matrix by solving a nonnegative least square regression. Both of these two algorithms based on greedy pursuit have smaller time complexity compared with LP-based methods. However, the time complexity is still too large for large-scaled data.

[ZBT13, ZBG14] utilize a Divide-and-Conquer Anchoring (DCA) framework to tackle the SNMF. Namely, by projecting the data set into several low-dimension subspaces, and each projection can determines a small set of anchors. Moreover, it can be proven that all the $k$ anchors can be identified by $O(k\log k)$ projections.

Recently, a quantum algorithm for SNMF called Quantum Divide-and-Conquer Anchoring algorithm (QDCA), has been presented [DLL*+*18], which uses the quantum technology to speed up the random projection step in [ZBT13]. QDCA implements matrix-vector product (i.e., random projection) via quantum principal component analysis and then a quantum state encoding the projected data points could be prepared efficiently. Moreover, there are also several papers utilizing dequantizing techniques to solve some low-rank matrix operations, such as recommendation systems [Tan18] and matrix inversion [GLT18, CLW18]. Dequantizing techniques in those algorithms involve two technologies, the Monte-Carlo singular value decomposition and rejection sampling, which could efficiently simulate some special operations on low-rank matrices.

Inspired by QDCA and the dequantizing techniques , we propose a classical randomized algorithm which speeds up the random projection step in [ZBT13] and thereby identifies all anchors efficiently. Our algorithm takes time polynomial in rank $k$ , condition number $\kappa$ and logarithm of the size of matrix. When rank $k=O(\log(mn))$ , our algorithm achieves exponentially speedup than any other classical algorithms for SNMF.

1.2 Organizations

The rest of this paper is organized as follows: In Section 2, we introduce notations, models and preliminaries of our algorithm; in Section 3, we present our algorithm and analyze its correctness and running time; and Section 4 concludes with a discussion of this paper and the future work.

2 Preliminaries

2.1 Notations

Let $[n]\mathrel{\mathop{\mathchar 58\relax}}=\{1,2,\ldots,n\}$ . Let $\text{span}\{{x}_{i}\in\mathbb{R}^{n}|i\in[k]\}\mathrel{\mathop{\mathchar 58\relax}}=\{\sum_{i=1}^{k}\alpha_{i}{x}_{i}|\alpha_{i}\in\mathbb{R},i\in[k]\}$ denote the space spanned by ${x}_{i}$ for $i\in[k]$ . For a matrix ${A}\in\mathbb{R}^{m\times n}$ , ${A}_{(i)}$ and ${A}^{(j)}$ denote the $i$ th row and the $j$ th column of ${A}$ for $i\in[m],j\in[n]$ , respectively. Let ${A}_{R}=[{A}^{T}_{(i_{1})},{A}^{T}_{(i_{2})},\ldots,{A}^{T}_{(i_{r})}]^{T}$ where ${A}\in\mathbb{R}^{m\times n}$ and $R=\{i_{1},i_{2},\ldots,i_{r}\}\subseteq[m]$ (without loss of generality, assume $i_{1}\leq i_{2}\leq\cdots\leq i_{r}$ ). $\mathinner{\!\left\lVert{A}\right\rVert}_{F}$ and $\mathinner{\!\left\lVert{A}\right\rVert}_{2}$ refer to Frobenius norm and spectral norm, respectively. For a vector $v\in\mathbb{R}^{n}$ , $\mathinner{\!\left\lVert v\right\rVert}$ denotes its $\ell_{2}$ -norm. For two probability distributions $p,q$ (as density functions) over a discrete universe $D$ , the total variation distance between them is defined as $\mathinner{\!\left\lVert p,q\right\rVert}_{TV}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{2}\sum_{i\in D}|p(i)-q(i)|$ . $\kappa({A})\mathrel{\mathop{\mathchar 58\relax}}=\sigma_{\max}/\sigma_{\min}$ denotes the condition number of ${A}$ , where $\sigma_{\max}$ and $\sigma_{\min}$ are the maximal and minimal non-zero singular values of ${A}$ .

2.2 Sample Model

In query model, algorithms for SNMF problem require time which is at least linear in the number of nonzero elements of the matrix, since in the worst case, they have to read out all entries. However, we expect our algorithm to be efficient even if the datasets are extremely large. Considering the QDCA in [DLL*+*18], one of its advantage is that data is prepared in quantum state and can be access via ‘quantum’ way (like sampling). Thus, in quantum algorithm, quantum state is served to represent data implicitly which can be read out by measurement only. In order to avoiding reading the whole matrix, we introduce a new sample model other than the query model based on the idea of quantum state preparation assumption.

Definition 1 ( $\ell_{2}$ -norm Sampling)

Let ${\mathcal{D}}_{v}$ denote the distribution over $[n]$ with density function ${\mathcal{D}}_{v}(i)=v_{i}^{2}/\mathinner{\!\left\lVert v\right\rVert}^{2}$ for $v\in\mathbb{R}^{n}$ . A sample from a distribution ${\mathcal{D}}_{v}$ is called a sample from $v$ .

Lemma 1 (Vector Sample Model)

There is a data structure storing vector $v\in\mathbb{R}^{n}$ in $O(n\log n)$ space, and supporting following operations:

•

Querying and updating a entry in $O(\log n)$ time;

•

Sampling from ${\mathcal{D}}_{v}$ in $O(\log n)$ time;

•

Finding $\mathinner{\!\left\lVert v\right\rVert}$ in $O(1)$ time.

Such a data structure can be easily implemented via Binary Search Tree (BST) (see Figure 1).

Proposition 1 (Matrix Sample Model)

Considering matrix ${A}\in\mathbb{R}^{m\times n}$ , let $\tilde{{A}}$ and $\tilde{{A}}^{\prime}$ be the vector whose entry is $\mathinner{\!\left\lVert{A}_{(i)}\right\rVert}$ and $\mathinner{\!\left\lVert{A}^{(j)}\right\rVert}$ , respectively. There is a data structure storing matrix ${A}\in\mathbb{R}^{m\times n}$ in $O(mn)$ space and supporting following operations:

•

Querying and updating an entry in $O(\log m+\log n)$ time;

•

Sampling from ${A}_{(i)}$ for any $i\in[m]$ in time $O(\log n)$ ;

•

Sampling from ${A}^{(j)}$ for any $j\in[n]$ in time $O(\log m)$ ;

•

Finding $\mathinner{\!\left\lVert{A}\right\rVert}_{F}$ , $\mathinner{\!\left\lVert{A}_{(i)}\right\rVert}$ and $\mathinner{\!\left\lVert{A}^{(j)}\right\rVert}$ in time $O(1)$ ;

•

Sampling $\tilde{{A}}$ and $\tilde{{A}}^{\prime}$ in time $O(\log m)$ and $O(\log n)$ , respectively.

This data structure can be easily implemented via Lemma 1, we can just use two arrays of BST to store all rows and columns of ${A}$ and use two extra BSTs store $\tilde{{A}}$ and $\tilde{{A}}^{\prime}$ .

2.3 Low-rank Approximations in Sample Model

FKV algorithm is a Monte-Carlo algorithm [FKV04] that returns approximate singular vectors of given matrix ${A}$ in matrix sample model. The low-rank approximation of ${A}$ can be reconstructed by approximate singular vectors. The query and sample complexity of FKV algorithm are independent of size of ${A}$ . FKV algorithm outputs a short ‘description’ of $\hat{{V}}$ , which is approximate to a right singular vectors ${V}$ of matrix ${A}$ . Similarly, FKV algorithm can output a description of approximate left singular vectors $\hat{{U}}$ of ${A}$ by inputting ${A}^{T}$ . Let FKV ( ${A},k,\epsilon,\delta$ ) denote the FKV algorithm, where ${A}$ is a matrix given by sample model, $k$ is the rank of approximate matrix of ${A}$ , $\epsilon$ is error parameter, and $\delta$ is the failure probability. The FKV algorithm is described in Theorem 2.1.

Theorem 2.1 (Low-rank Approximations, [FKV04])

Given matrix ${A}\in\mathbb{R}^{m\times n}$ in matrix sample model, $k\in\mathbb{N}$ and $\epsilon,\delta\in(0,1)$ , FKV algorithm outputs the description of the approximate right singular vectors $\hat{{V}}\in\mathbb{R}^{n\times k}$ in $O(poly(k,1/\epsilon,\log\frac{1}{\delta}))$ samples and queries of ${A}$ with probability $1-\delta$ , which satisfies

[TABLE]

Especially, if ${A}$ is a matrix with rank $k$ exactly, Theorem 2.1 also implies an inequality:

[TABLE]

Description of $\hat{{V}}$ .

Note that FKV algorithm does not output the approximate right singular vectors $\hat{{V}}$ directly since their lengths are linear of $n$ . It returns a description of $\hat{{V}}$ , which consists of three components: the row index sets $T\mathrel{\mathop{\mathchar 58\relax}}=\{i_{t}\in[m]|t\in[p]\}$ , a vector set $U\mathrel{\mathop{\mathchar 58\relax}}=\{u^{(j)}\in\mathbb{R}^{p}|j\in[k]\}$ which are singular vectors of a submatrix sampled from ${A}$ , and its corresponding singular values $\Sigma\mathrel{\mathop{\mathchar 58\relax}}=\{\sigma^{(j)}|j\in[k]\}$ , where $p=O(poly(k,\frac{1}{\epsilon}))$ . In fact, $\hat{{V}}^{(i)}\mathrel{\mathop{\mathchar 58\relax}}={A}_{T}u^{(i)}/\sigma_{i}$ for $i\in[k]$ . Given a description of $\hat{{V}}$ , we can sample from $\hat{{V}}^{(i)}$ in time $O(poly(k,\frac{1}{\epsilon}))$ for $i\in[k]$ [Tan18] and query its entry in time $O(poly(k,\frac{1}{\epsilon}))$ .

Definition 2 ( $\alpha$ -orthonormal)

Given $\alpha>0$ , $\hat{{V}}\in\mathbb{R}^{n\times k}$ is called $\alpha$ -approximately orthonormal if $1-\alpha/k\leq\mathinner{\!\left\lVert\hat{{V}}^{(i)}\right\rVert}^{2}\leq 1+\alpha/k$ for $i\in[k]$ and $|\hat{{V}}^{(s)}\hat{{V}}^{(t)}|\leq\alpha/k$ for $s\neq t\in[k]$ .

The next lemma presents some properties of $\alpha$ -approximate orthonormal vectors.

Lemma 2 (Properties of $\alpha$ -orthonormal Vectors, [Tan18])

Given a set of $k$ $\alpha$ -approximately orthonormal vectors $\hat{{V}}\in\mathbb{R}^{n\times k}$ , then there exists a set of $k$ orthonormal vectors ${V}\in\mathbb{R}^{n\times k}$ spanning the columns of $\hat{{V}}$ such that

[TABLE]

where $\Pi_{\hat{{V}}}\mathrel{\mathop{\mathchar 58\relax}}={V}{V}^{T}$ represents the orthonormal projector to image of $\hat{{V}}$ and $c_{1},c_{2}>0$ are constants.

Lemma 3 ([FKV04])

The output vectors $\hat{{V}}\in\mathbb{R}^{n\times k}$ of $\textsc{FKV }({A},k,\epsilon,\delta)$ is $\epsilon k/16$ -approximate orthonormal.

3 Fast Anchors Seeking Algorithm

In this section, we present a randomized algorithm for SNMF which is called Fast Anchors Seeking (FAS) Algorithm. Especially, the input ${A}\in\mathbb{R}_{\geq 0}^{m\times n}$ of FAS is given by matrix sample model which is realized via a data structure described in Section 2. FAS returns the indices of anchors in time polynomial logarithmic to the size of matrix.

3.1 Description of Algorithm

Recall that SNMF aims to factorize ${A}=F{A}_{R}$ where $R$ is the index set of anchors. In this paper, an additional constraint is added: the sum of entries in any row of $F$ is 1. Namely, any data point of ${A}$ resides in convex hull which is the set of all convex combination of ${A}_{R}$ . In fact, normalizing each row of matrix ${A}$ by $\ell_{1}$ -norm is valid, since the anchors remain unchanged. Moreover, Instead of storing $\ell_{1}$ -normalized matrix $A$ , we can just maintain the $\ell_{1}$ -norms for all rows and columns.

The Quantum Divide-and-Conquer Anchoring (QDCA) is a quantum algorithm for SNMF which achieves exponential speedup than any classical algorithms [DLL*+*18]. After projecting any convex hull into an 1-dimensional space, the geometric information is partially preserved. Especially, the anchors in 1-dimensional projected subspace are still anchors in the original space. The main idea of QDCA is quantizing random projection step in DCA. It decomposes SNMF into several subproblems: projecting ${A}$ onto a set of random unit vectors $\{\beta_{i}\in\mathbb{R}^{n}\}_{i=1}^{s}$ with $s=O(k\log k)$ , i.e., computing ${A}\beta_{i}\in\mathbb{R}^{m}$ . Such a matrix-vector product can be efficiently implemented by Quantum Principle Component Analysis (QPCA). And then it returns a $\log m$ -qubits quantum state whose amplitudes are proportional to entries of ${A}\beta_{i}$ . Measurement of quantum state outcomes an index $j\in[m]$ which obeys distribution ${\mathcal{D}}_{{A}\beta_{i}}$ . Thus, we can prepare $O(poly\log m)$ copies of quantum states, measure each of them in computational basis and record the most frequent index. By repeating procedure above with $s=O(k\log k)$ times, we could successively identify all anchors with high probability.

As discussed above, the core and most costly procedure is to simulate ${\mathcal{D}}_{{A}\beta_{i}}$ . At the first sight, traditional algorithms can not achieve exponential speedup on account of limits of computational model. In QDCA, vectors are encoded into quantum states and we can sample the entries with probability proportional to their magnitudes by measurements. This quantum state preparation overcomes the bottleneck of traditional computational model. Based on divided-and-conquer scheme and sample model (See Section 2.2), we present Fast Anchors Seeking (FAS) Algorithm inspired by QDCA. Designing FAS is quite hard and non-trivial although FAS and QDCA have the same scheme. Indeed, we can simulate ${\mathcal{D}}_{{A}\beta_{i}}$ directly by rejection sampling technology. However, the number of iterations of rejection sampling is unbounded. To overcome this difficulty, we translate matrix $A$ into its approximation $\hat{U}\hat{U}^{T}A$ , where the columns $\hat{{U}}\in\mathbb{R}^{m\times k}$ consists of $k$ approximate left singular vectors of matrix ${A}$ and $k=rank({A})$ . Next, it is obvious that ${y}=\hat{{U}}^{T}A\beta_{i}\in\mathbb{R}^{k}$ is a short vector and we can estimate its entries one by one (see Lemma 7) efficiently. Now the problem becomes to simulate ${\mathcal{D}}_{\hat{{U}}{y}}$ and it can be done by Lemma 6 .

Given an error parameter $\epsilon/2$ , the method described above will result in $\mathinner{\!\left\lVert{A}\beta_{i}-\hat{{U}}\hat{U}^{T}A\beta_{i}\right\rVert}<\epsilon\mathinner{\!\left\lVert A\right\rVert}_{F}\mathinner{\!\left\lVert\beta_{i}\right\rVert}/2$ via Theorem 2.1, which implies $\mathinner{\!\left\lVert{\mathcal{D}}_{{A}\beta_{i}}-{\mathcal{D}}_{\hat{{U}}\hat{U}^{T}{A}\beta_{i}}\right\rVert}_{TV}\leq\epsilon\mathinner{\!\left\lVert A\right\rVert}_{F}\mathinner{\!\left\lVert\beta_{i}\right\rVert}/\mathinner{\!\left\lVert{A}\beta_{i}\right\rVert}$ . Namely, the method above introduces an unbounded error in form $\epsilon\mathinner{\!\left\lVert A\right\rVert}_{F}\mathinner{\!\left\lVert\beta_{i}\right\rVert}/\mathinner{\!\left\lVert A\beta_{i}\right\rVert}$ if $\beta_{i}$ is arbitrary vector in entire space $\mathbb{R}^{n}$ . Fortunately, this issue can be solved by generating random vectors $\{\beta_{i}\}_{i=1}^{s}$ lying in row space of $A$ instead of those lying in entire space $\mathbb{R}^{n}$ . To generate uniform random unit vectors on the row space of ${A}$ , we need to find a basis of row space of ${A}$ . If ${V}\in\mathbb{R}^{n\times k}$ is a set of orthonormal basis of the row space of ${A}$ (the space spanned by the right singular vectors), and ${x}_{i}$ is uniform random unit vector on $\mathbb{S}^{k-1}$ , then $\beta={V}{x}_{i}$ is a unit random vector in row space of ${A}$ . Moreover, FKV algorithm will figure out approximate singular vectors $\hat{{V}}$ for ${V}$ , that can help us make an approximate $\hat{\beta_{i}}=\hat{{V}}{x}_{i}$ for $\beta_{i}$ . Therefore, we will estimate distribution ${\mathcal{D}}_{\hat{U}\hat{{U}}^{T}{A}\hat{{V}}{x}_{i}}$ instead of ${\mathcal{D}}_{{A}\beta_{i}}$ . Based on Corollary 7, $\hat{{U}}^{T}{A}\hat{{V}}$ can be estimated efficiently. According to Lemma 6, $\hat{{U}}{y}$ can be sampled efficiently, thus we can treat $\tilde{{y}}$ as estimation of $\hat{{U}}^{T}{A}\hat{{V}}{x}_{i}$ (see Figure 2).

Once we can simulate distribution ${\mathcal{D}}_{{A}{V}{x}_{i}}$ , we can figure out the index of the largest component of vector ${A}{V}{x}_{i}$ by picking up $O(poly\log m)$ samples (Theorem 3.2). Moreover, according to [ZBT13], by repeating this procedure with $O(k\log{k})$ times, we can find all anchors of ${A}$ with high probability (For single step of random projection, see Figure 3).

3.2 Analysis

Now, we propose our main theorem and analyze the correctness and complexity of our algorithm FAS.

Theorem 3.1 (Main Result)

Given separable non-negative matrix ${A}\in\mathbb{R}_{\geq 0}^{m\times n}$ in matrix sample model, the rank $k$ , condition number $\kappa$ and a constant $\delta\in(0,1)$ , Algorithm 1 returns the indices of anchors with probability at least $1-\delta$ in time

[TABLE]

3.2.1 Correctness

In this subsection, we will analyze the correctness of Algorithm 1. Firstly, we show that the columns of ${V}$ defined in Lemma 2 form a basis of row space of matrix ${A}$ , which is necessary to generate unit vector in row space of ${A}$ . The next, we prove that for each $i\in[s]$ , distribution $\mathcal{O}_{i}$ is $\epsilon$ -close to distribution ${\mathcal{D}}_{{A}{V}{x}_{i}}$ in total variant distance. Once again, we show how to gain the index of largest component of ${A}{V}{x}_{i}$ from distribution $\mathcal{O}_{i}$ . Finally, by $O(k\log{k})$ random projection, it is enough for us to gain all anchors of matrix ${A}$ .

The following lemma tells us the approximate singular vectors outputted by FKV spans the row space of matrix ${A}$ . And combining with Lemma 2, it gives us that ${V}$ also spans the same space, i.e., ${V}$ forms an orthonormal basis of row space of matrix ${A}$ .

Lemma 4

Let $\hat{{V}}$ be the output of algorithm $\textsc{FKV }({A},k,\epsilon,\delta)$ . If $\epsilon<\frac{1}{k\kappa^{2}}$ , then with probability $1-\delta$ , we obtain

[TABLE]

Proof

By contradiction, we assume that $\textnormal{span}\left\{\hat{{V}}^{(i)}\middle|i\in[k]\right\}\neq\textnormal{span}\left\{{A}_{(i)}\middle|i\in[m]\right\}$ , which implies that there exists a unit vector ${x}\in\textnormal{span}\left\{{A}_{(i)}\middle|i\in[m]\right\}$ and ${x}\perp\textnormal{span}\left\{\hat{{V}}^{(i)}\middle|i\in[k]\right\}$ . Then we can obtain $\mathinner{\!\left\lVert{A}{x}-{A}\hat{{V}}\hat{{V}}^{T}{x}\right\rVert}=\mathinner{\!\left\lVert{A}{x}\right\rVert}\geq\sigma_{\min}({A})$ since $\hat{{V}}^{T}{x}=\vec{0}$ . And according to Theorem 2.1, we have

[TABLE]

Thus $\sigma_{\min}({A})\leq\sqrt{\epsilon k}\kappa\sigma_{\min}({A})$ , which makes a contradiction if $\epsilon<1/k\kappa^{2}$ .

By Lemma 4, we can generate an approximate random vector in the row space of ${A}$ with probability $1-\delta$ in time $O(poly(k,1/\epsilon,\log 1/\delta))$ by FKV ( ${A},k,\epsilon,\delta$ ). Firstly, we obtain the description of approximate right singular vectors by FKV algorithm, where the error parameter $\epsilon$ is bounded by rank $k$ and condition number $\kappa$ (see in Lemma 4). Secondly, we generate a random unit vector ${x}_{i}\in\mathbb{R}^{k}$ as a coordinate vector referring to a set of orthonormal vectors in Lemma 2. Let $V$ denotes the matrix defined in Lemma 2, then it is obvious that its columns form the right singular vectors for matrix $A$ . That is, $\hat{\beta_{i}}=\hat{{V}}{x}_{i}$ is an approximate vector of a random vector $\beta={V}{x}_{i}$ . Next, we show that total variant distance between $\mathcal{O}_{i}$ and ${\mathcal{D}}_{{A}{V}{x}_{i}}$ is bounded by constant $\epsilon$ . For convenience, we assume that each step in Algorithm 1 succeeds and the final success probability will be given in next subsection.

Lemma 5

For all $i\in[s]$ , $\mathinner{\!\left\lVert\mathcal{O}_{i},{\mathcal{D}}_{{A}{V}{x}_{i}}\right\rVert}_{TV}\leq\epsilon$ holds simultaneously with probability $1-\delta$ .

In the rest, without ambiguity, we use notations $\mathcal{O}$ , ${x}$ instead of $\mathcal{O}_{i}$ , ${x}_{i}$ . By applying triangle inequality, we divide the left part of inequality into four parts (the intuition idea please ref Figure 2):

[TABLE]

Thus, we only need to prove that ①, ②, ③, ④ $<\frac{\epsilon}{4}$ , respectively. In addition, given $u,v\in\mathbb{R}^{n}$ , if $\mathinner{\!\left\lVert u-v\right\rVert}\leq\frac{\epsilon}{2}\mathinner{\!\left\lVert u\right\rVert}$ , then $\mathinner{\!\left\lVert{\mathcal{D}}_{u},{\mathcal{D}}_{v}\right\rVert}_{TV}\leq\epsilon$ . For ①, ② and ③, we only show their $\ell_{2}$ -norm version, i.e.,

•

$\|{A}{V}{x}-{A}\hat{{V}}{x}\|\leq\frac{\epsilon}{8}\|{A}{V}{x}\|$ ;

•

$\|{A}\hat{{V}}{x}-\hat{{U}}\hat{{U}}^{T}{A}\hat{{V}}{x}\|\leq\frac{\epsilon}{8}\|{A}\hat{{V}}{x}\|$ ;

•

$\|\hat{{U}}\hat{{U}}^{T}{A}\hat{{V}}{x}-\hat{{U}}\tilde{{M}}{x}\|\leq\frac{\epsilon}{8}\|\hat{{U}}\hat{{U}}^{T}{A}\hat{{V}}{x}\|$ .

For convenience, in the rest part, let $\alpha_{U}=\epsilon_{U}k/16$ and $\alpha_{V}=\epsilon_{V}k/16$ represent approximate ratio for orthonormality of $\hat{{U}}$ and $\hat{{V}}$ based on Lemma 3, respectively.

Before we start our proof, we list two tools which are used to prove ③ and ④, respectively.

Based on rejection sampling, Lemma 6 shows that sampling from linear combination of $\alpha$ -approximately orthogonal vectors can be quickly realized without knowledge of norms of these vectors (see Algorithm 2).

Lemma 6 ([Tan18])

Given a set of $\alpha$ -approximately orthonormal vectors $\hat{{V}}\in\mathbb{R}^{n\times k}$ in vector sample model, and an input vector $w\in\mathbb{R}^{k}$ , there exists an algorithm outputting a sample from a distribution $\frac{\alpha}{1-\alpha}$ -close to ${\mathcal{D}}_{\hat{{V}}w}$ with probability $1-\gamma$ using $O(k^{2}\log\frac{1}{\gamma}(1+O(\alpha)))$ queries and samples.

Lemma 7

Given ${A}\in\mathbb{R}^{m\times n}$ in matrix sample model and $L\in\mathbb{R}^{k_{1}\times m}$ and $R\in\mathbb{R}^{n\times k_{2}}$ in query model, let ${M}=L{A}R$ , then we can output a matrix $\tilde{{M}}\in\mathbb{R}^{k_{1}\times k_{2}}$ , with probability $1-\eta$ , such that

[TABLE]

by $O\left(k_{1}k_{2}\frac{1}{\zeta^{2}}\log\frac{1}{\eta}\right)$ queries and samples.

Proof

Let ${M}_{ij}=L_{(i)}{A}R^{(j)}$ with $i\in[k_{1}]$ and $j\in[k_{2}]$ . In [Tan18], there exists an algorithm that outputs an estimation of ${M}_{ij}$ ( $\tilde{{M}}_{ij}$ ) to precision $\zeta\mathinner{\!\left\lVert{A}\right\rVert}_{F}\mathinner{\!\left\lVert L_{(i)}\right\rVert}\mathinner{\!\left\lVert R^{(j)}\right\rVert}$ with probability $1-\eta^{\prime}$ in time $O\left(\frac{1}{\zeta^{2}}\log\frac{1}{\eta^{\prime}}\right)$ . Let $\eta^{\prime}=1-(1-\eta)^{1/(k_{1}k_{2})}$ . We can output $\tilde{{M}}$ with probability $1-\eta$ utilizing $O\left(k_{1}k_{2}\frac{1}{\zeta^{2}}\log\frac{1}{1-(1-\eta)^{1/k^{2}}}\right)=O\left(k_{1}k_{2}\frac{1}{\zeta^{2}}\log\frac{1}{\eta}\right)$ queries and samples respectively where $\tilde{{M}}$ satisfies

[TABLE]

Proof (proof of Lemma 5)

Upper bound for ①. By Lemma 4, ${V}{x}$ is a unit random vector sampled from the row space of ${A}$ with probability $1-\delta_{{V}}\mathrel{\mathop{\mathchar 58\relax}}=(1-\delta)^{\frac{1}{4}}$ if $\epsilon_{{V}}<\frac{1}{k\kappa^{2}}$ . From Eq. (1) in Lemma 2, with probability $1-\delta_{{V}}$

[TABLE]

Combing with $\mathinner{\!\left\lVert{A}\right\rVert}_{F}\leq\sqrt{k}\kappa\sigma_{\min}({A})\leq\sqrt{k}\kappa\mathinner{\!\left\lVert{A}{V}{x}\right\rVert}$ , we gain

[TABLE]

Eq. (3) satisfies $\mathinner{\!\left\lVert{A}{V}{x}-{A}\hat{{V}}{x}\right\rVert}\leq\frac{\epsilon}{8}\mathinner{\!\left\lVert{A}{V}{x}\right\rVert}$ with $\epsilon_{{V}}=O\left(\min\left\{\frac{\epsilon}{\sqrt{k}\kappa},\frac{1}{k\kappa^{2}}\right\}\right)$ .

Upper bound for ②. According to Lemma 4, the columns of $\hat{{U}}$ span a space equal to the column space of ${A}$ if $\epsilon_{{U}}\leq\frac{1}{k\kappa^{2}}$ with probability $1-\delta_{{U}}\mathrel{\mathop{\mathchar 58\relax}}=(1-\delta)^{\frac{1}{4}}$ . Let $\Pi_{\hat{{U}}}$ denote the orthonormal projector to image of $\hat{{U}}$ (column space of ${A}$ ). Similarly, $\Pi_{\hat{{U}}}^{\perp}$ denotes the orthonormal projector to the orthogonal space of column space of $\hat{{U}}$ .

[TABLE]

based on Eq. (2) in Lemma 2. If $\epsilon_{{U}}=O\left(\min\left\{\frac{\epsilon}{k},\frac{1}{k\kappa^{2}}\right\}\right)$ , $\mathinner{\!\left\lVert{A}\hat{{V}}{x}-\hat{{U}}\hat{{U}}^{T}{A}\hat{{V}}{x}\right\rVert}\leq\frac{\epsilon}{8}\mathinner{\!\left\lVert{A}\hat{{V}}{x}\right\rVert}$ .

Upper bound for ③. When $\epsilon_{{U}}$ and $\epsilon_{{V}}$ are discussed above, with probability $1-\eta\mathrel{\mathop{\mathchar 58\relax}}=(1-\delta)^{\frac{1}{4}}$ , we have

[TABLE]

According to Lemma 7, we obtain

[TABLE]

Combining Eq. (5) and Eq. (6), the following holds

[TABLE]

If $\zeta=O\left(\frac{\epsilon}{k^{2}\kappa}\right)$ , then $\mathinner{\!\left\lVert\hat{{U}}\hat{{U}}^{T}{A}\hat{{V}}{x}-\hat{{U}}\tilde{{M}}{x}\right\rVert}<\frac{\epsilon}{8}\mathinner{\!\left\lVert\hat{{U}}\hat{{U}}^{T}{A}\hat{{V}}{x}\right\rVert}$ holds.

Upper bound for ④. Since $\epsilon_{{U}}=O\left(\min\left\{\frac{\epsilon}{k},\frac{1}{k\kappa^{2}}\right\}\right)$ as discussed before, directly taking usage of Lemma 6, with probability $1-\gamma\mathrel{\mathop{\mathchar 58\relax}}=(1-\delta)^{\frac{1}{4s}}$ we have

[TABLE]

Hence, Algorithm 1 generates a distribution $\mathcal{O}_{i}$ which satisfies $\mathinner{\!\left\lVert O_{i},{\mathcal{D}}_{{A}{V}{x}_{i}}\right\rVert}_{TV}\leq\epsilon$ for $s$ random unit vectors generated simultaneously with probability $1-\delta$ .

The following theorem tells us how to find the largest component of ${A}{V}{x}_{i}$ from distribution $\mathcal{O}_{i}$ .

Theorem 3.2 (Restatement of Theorem 1 in [DLL+18])

Let ${\mathcal{D}}$ be a distribution over $[m]$ and ${\mathcal{D}}^{\prime}$ is another distribution simulating ${\mathcal{D}}$ with total variant error $\epsilon$ . Let ${x}_{1},\ldots,{x}_{N}$ be examples independently sampled from ${\mathcal{D}}^{\prime}$ and $N_{i}$ be the number of examples taking value of $i$ . Let ${\mathcal{D}}_{\max}=\max\{{\mathcal{D}}_{1},\ldots,{\mathcal{D}}_{m}\}$ and ${\mathcal{D}}_{secmax}=\max\{{\mathcal{D}}_{1},\ldots,{\mathcal{D}}_{m}\}\backslash{\mathcal{D}}_{\max}$ . If ${\mathcal{D}}_{\max}-{\mathcal{D}}_{secmax}>2\sqrt{2\log(4N/\delta)/N}+\epsilon$ , then, for any $\delta>0$ , with a probability at least $1-\delta$ , we have

[TABLE]

As mentioned in [DLL*+*18], the assumption about the gap between ${\mathcal{D}}_{max}$ and ${\mathcal{D}}_{secmax}$ is easy to satisfy in practice. By choosing $N=\log^{2}m$ and $\epsilon<2\sqrt{2\log(4\log^{2}m/\delta)/\log^{2}m}$ , we have ${\mathcal{D}}_{max}-{\mathcal{D}}_{secmax}>4\sqrt{2\log(4\log^{2}m/\delta)/\log^{2}m}$ , which will converge to zero as $m$ goes to infinity.

To estimate the number of random projections we need, we denote $p^{*}_{i}$ the probability that after random projection $\beta$ , a data point $A_{(i)}$ is identified as an anchor in subspace, i.e.,

[TABLE]

In [ZBG14], if $p_{i}^{*}>k/\alpha$ for a constant $\alpha$ , with $s=\frac{3}{\alpha}k\log k$ random projections, all anchors can be found with probability at least $1-k\exp(-\alpha s/3k)$ .

3.2.2 Complexity and Success Probability

Note that Algorithm 1 involves operations that query and sample from matrix $A$ , $\hat{U}$ and $\hat{V}$ , but those operations can be implemented in $O(\log(mn)poly(k,\kappa,1/\epsilon))$ time. Thus, in the following analysis, we just ignore the time complexity of those operations but multiple it to the final time complexity.

The running time and failure probability mainly concentrates on lines 4, 6, 7 and 11 in Algorithm 1. The running time of lines 4 and 6 are $O\left(poly(k,1/\epsilon_{{V}},\log 1/\delta_{{V}})\right)$ and $O\left(poly(k,\frac{1}{\epsilon_{{U}}},\log\frac{1}{\delta_{{U}}})\right)$ , respectively, according to Theorem 2.1. And line 7 takes $O\left(k^{2}\frac{1}{\zeta^{2}}\log\frac{1}{\eta}\right)$ to estimate matrix $\tilde{M}$ according to Lemma 7. And line 11 with $s$ iterations totally spends $O\left(sk^{2}\log\frac{1}{\gamma}poly\log m\right)$ . In the perspective of failure probability, lines 4, 6 and 7 take the same failure probabilities $(1-\eta)^{\frac{1}{4}}$ . And line 11 takes $(1-\eta)^{\frac{1}{4s}}$ for each iteration.

Above all, the time complexity of FAS is $O\left(poly\left(k,\kappa,\log\frac{1}{\delta},\log mn\right)\right)$ . The success probability is $1-\delta$ .

4 Conclusion

This paper presents a classical randomized algorithm FAS which dramatically reduces the running time to find anchors of low-rank matrix. Especially, we achieve exponential speedup when the rank is logarithmic of the input scale. Although our algorithm running in polynomial of logarithm of matrix dimension, it still has a bad dependence on rank $k$ . In the future, we plan to improve its dependence on rank as well as analyze its noise tolerance.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AGKM 12] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization–provably. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing , pages 145–162. ACM, 2012.
2[AGM 12] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on , pages 1–10. IEEE, 2012.
3[CLW 18] Nai-Hui Chia, Han-Hsuan Lin, and Chunhao Wang. Quantum-inspired sublinear classical algorithms for solving low-rank linear systems. ar Xiv preprint ar Xiv:1811.04852 , 2018.
4[DLJ 10] Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence , 32(1):45–55, 2010.
5[DLL + 18] Yuxuan Du, Tongliang Liu, Yinan Li, Runyao Duan, and Dacheng Tao. Quantum divide-and-conquer anchoring for separable non-negative matrix factorization. In Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages 2093–2099. AAAI Press, 2018.
6[DS 04] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems , pages 1141–1148, 2004.
7[EMO + 12] Ernie Esser, Michael Moller, Stanley Osher, Guillermo Sapiro, and Jack Xin. A convex model for nonnegative matrix factorization and dimensionality reduction on physical space. IEEE Transactions on Image Processing , 21(7):3239–3252, 2012.
8[ESV 12] Ehsan Elhamifar, Guillermo Sapiro, and Rene Vidal. See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 1600–1607. IEEE, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization

Abstract

1 Introduction

1.1 Related work

1.2 Organizations

2 Preliminaries

2.1 Notations

2.2 Sample Model

Definition 1 (ℓ2\ell_{2}ℓ2​-norm Sampling)

Lemma 1 (Vector Sample Model)

Proposition 1 (Matrix Sample Model)

2.3 Low-rank Approximations in Sample Model

Theorem 2.1 (Low-rank Approximations, [FKV04])

Description of V^\hat{{V}}V^.

Definition 2 (α\alphaα-orthonormal)

Lemma 2 (Properties of α\alphaα-orthonormal Vectors, [Tan18])

Lemma 3 ([FKV04])

3 Fast Anchors Seeking Algorithm

3.1 Description of Algorithm

3.2 Analysis

Theorem 3.1 (Main Result)

3.2.1 Correctness

Lemma 4

Proof

Lemma 5

Lemma 6 ([Tan18])

Lemma 7

Proof

Proof (proof of Lemma 5)

Theorem 3.2 (Restatement of Theorem 1 in [DLL*+*18])

3.2.2 Complexity and Success Probability

4 Conclusion

Definition 1 ( $\ell_{2}$ -norm Sampling)

Description of $\hat{{V}}$ .

Definition 2 ( $\alpha$ -orthonormal)

Lemma 2 (Properties of $\alpha$ -orthonormal Vectors, [Tan18])

Theorem 3.2 (Restatement of Theorem 1 in [DLL+18])