Theory of Spectral Method for Union of Subspaces-Based Random Geometry   Graph

Gen Li; Yuantao Gu

arXiv:1907.10906·cs.LG·July 26, 2019

Theory of Spectral Method for Union of Subspaces-Based Random Geometry Graph

Gen Li, Yuantao Gu

PDF

Open Access 1 Video

TL;DR

This paper develops a theoretical framework for spectral methods in clustering data near unions of subspaces using random geometry graphs, demonstrating broad conditions for effectiveness and supporting findings with numerical experiments.

Contribution

It provides the first comprehensive theory analyzing spectral subspace clustering via random geometry graphs, expanding understanding of its efficiency and potential applications.

Findings

01

Spectral method effectively clusters data near unions of subspaces.

02

Theoretical analysis confirms broad conditions for success.

03

Numerical experiments validate the theoretical predictions.

Abstract

Spectral Method is a commonly used scheme to cluster data points lying close to Union of Subspaces by first constructing a Random Geometry Graph, called Subspace Clustering. This paper establishes a theory to analyze this method. Based on this theory, we demonstrate the efficiency of Subspace Clustering in fairly broad conditions. The insights and analysis techniques developed in this paper might also have implications for other random graph problems. Numerical experiments demonstrate the effectiveness of our theoretical study.

Equations164

aff

aff

κ

ρ

y = x + z,

y = x + z,

SNR = 10 lo g \frac{1}{σ ^{2}},

SNR = 10 lo g \frac{1}{σ ^{2}},

γ = \frac{1}{4} \frac{1}{N} sgn (w) - v_{2}^{2} .

γ = \frac{1}{4} \frac{1}{N} sgn (w) - v_{2}^{2} .

\frac{1}{N} sgn (w) - v_{2} \leq

\frac{1}{N} sgn (w) - v_{2} \leq

\leq

=

\leq

\leq

∥ A x - λ x ∥_{2} \geq (λ - λ_{3} (A)) ∥ x - P_{W} x ∥_{2},

∥ A x - λ x ∥_{2} \geq (λ - λ_{3} (A)) ∥ x - P_{W} x ∥_{2},

γ = \frac{1}{4} \frac{1}{N} sgn (w) - v_{2}^{2} ≲ \frac{∥ A u - λ _{1} u ∥ _{2}^{2}}{( λ _{1} - λ _{3} ( A ) ) ^{2}} + \frac{∥ A v - λ _{2} v ∥ _{2}^{2}}{( λ _{2} - λ _{3} ( A ) ) ^{2}},

γ = \frac{1}{4} \frac{1}{N} sgn (w) - v_{2}^{2} ≲ \frac{∥ A u - λ _{1} u ∥ _{2}^{2}}{( λ _{1} - λ _{3} ( A ) ) ^{2}} + \frac{∥ A v - λ _{2} v ∥ _{2}^{2}}{( λ _{2} - λ _{3} ( A ) ) ^{2}},

{\forall i, ∣∥ a_{i} ∥ - 1∣ < t}

{\forall i, ∣∥ a_{i} ∥ - 1∣ < t}

{\forall i, k \sum λ_{k}^{2} a_{ik}^{2} - \frac{\sum _{k} λ _{k}^{2}}{d} < t}

{\forall i \neq = j, ∣ ⟨ x_{i}, x_{j} ⟩ ∣ < t},

{\mathbb{P}}\left(\mathcal{E}\bigg{(}c_{1}\sqrt{\frac{\log N}{d}}\bigg{)}\right)>1-\mathrm{e}^{-c_{2}\log N}.

{\mathbb{P}}\left(\mathcal{E}\bigg{(}c_{1}\sqrt{\frac{\log N}{d}}\bigg{)}\right)>1-\mathrm{e}^{-c_{2}\log N}.

Φ (t) := \int_{∣ x ∣ > t} \frac{1}{2 π} e^{- \frac{x ^{2}}{2}} d x,

Φ (t) := \int_{∣ x ∣ > t} \frac{1}{2 π} e^{- \frac{x ^{2}}{2}} d x,

p = p_{i} > Φ (τ_{d} (1 + t)) - e^{- c_{2} l o g N},

p = p_{i} > Φ (τ_{d} (1 + t)) - e^{- c_{2} l o g N},

p_{i}:={\mathbb{P}}\left(|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{x}}_{i}\right)={\mathbb{P}}\left(|\langle\overline{{\bm{a}}}_{i},\overline{{\bm{a}}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{a}}_{i}\right),

p_{i}:={\mathbb{P}}\left(|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{x}}_{i}\right)={\mathbb{P}}\left(|\langle\overline{{\bm{a}}}_{i},\overline{{\bm{a}}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{a}}_{i}\right),

\displaystyle{\mathbb{P}}\left(|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{x}}_{i}\right)

\displaystyle{\mathbb{P}}\left(|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{x}}_{i}\right)

=

\geq

>

P (∥ a_{j} ∥ > 1 + t) < e^{- c_{2} l o g N}

P (∥ a_{j} ∥ > 1 + t) < e^{- c_{2} l o g N}

Φ (\frac{τ _{d} ( 1 + t )}{aff ^{2} - t}) - e^{- c_{2} l o g N} < q_{i} < Φ (\frac{τ _{d} ( 1 - t )}{aff ^{2} + t}) + e^{- c_{2} l o g N},

Φ (\frac{τ _{d} ( 1 + t )}{aff ^{2} - t}) - e^{- c_{2} l o g N} < q_{i} < Φ (\frac{τ _{d} ( 1 - t )}{aff ^{2} + t}) + e^{- c_{2} l o g N},

\displaystyle{\bm{U}}_{2}^{\top}{\bm{U}}_{1}=\left[\begin{array}[]{ccc}\lambda_{1}&&\\ &\ddots&\\ &&\lambda_{d}\end{array}\right].

\displaystyle{\bm{U}}_{2}^{\top}{\bm{U}}_{1}=\left[\begin{array}[]{ccc}\lambda_{1}&&\\ &\ddots&\\ &&\lambda_{d}\end{array}\right].

{\mathbb{P}}\left(|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{x}}_{i}\right)={\mathbb{P}}\left(|\langle{\bm{U}}_{1}\overline{{\bm{a}}}_{i},{\bm{U}}_{2}\overline{{\bm{a}}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{a}}_{i}\right),

{\mathbb{P}}\left(|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{x}}_{i}\right)={\mathbb{P}}\left(|\langle{\bm{U}}_{1}\overline{{\bm{a}}}_{i},{\bm{U}}_{2}\overline{{\bm{a}}}_{j}\rangle|\geq\tau{\Big{|}}{\bm{a}}_{i}\right),

∥ U_{2}^{⊤} U_{1} \overline{a}_{i} ∥^{2} - aff^{2} < t,

∥ U_{2}^{⊤} U_{1} \overline{a}_{i} ∥^{2} - aff^{2} < t,

q_{i} =

q_{i} =

=

\geq

>

q_{i} < Φ (\frac{d τ ( 1 - t )}{aff ^{2} + t}) + e^{- c_{2} l o g N} .

q_{i} < Φ (\frac{d τ ( 1 - t )}{aff ^{2} + t}) + e^{- c_{2} l o g N} .

p - q ≳ κ,

p - q ≳ κ,

\frac{1}{N} i \sum (q_{i} - q)^{2} ≲ \frac{lo g N}{d} .

\frac{1}{N} i \sum (q_{i} - q)^{2} ≲ \frac{lo g N}{d} .

P \frac{1}{N /2 - 1} j : x_{j} \in S \sum A_{ij} - p > t < e^{- \frac{t ^{2} ( N /2 - 1 )}{p + \frac{1}{3} t}},

P \frac{1}{N /2 - 1} j : x_{j} \in S \sum A_{ij} - p > t < e^{- \frac{t ^{2} ( N /2 - 1 )}{p + \frac{1}{3} t}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Theory of Spectral Method for Union of Subspaces-Based Random Geometry Graph· slideslive

Taxonomy

TopicsFace and Expression Recognition · Computational Geometry and Mesh Generation · Topological and Geometric Data Analysis

Full text

Theory of Spectral Method for Union of Subspaces-Based Random Geometry Graph

Gen Li and Yuantao Gu

The authors are with Department of Electronic Engineering, Tsinghua University, Beijing 100084, China. The corresponding author of this paper is Y. Gu ([email protected]).

(Manuscript submitted July 23, 2019.)

Abstract

Spectral Method is a commonly used scheme to cluster data points lying close to Union of Subspaces by first constructing a Random Geometry Graph, called Subspace Clustering. This paper establishes a theory to analyze this method. Based on this theory, we demonstrate the efficiency of Subspace Clustering in fairly broad conditions. The insights and analysis techniques developed in this paper might also have implications for other random graph problems. Numerical experiments demonstrate the effectiveness of our theoretical study.

Keywords: Spectral Method, Union of Subspaces, Subspace Clustering, Random Graph, Random Geometry Graph

1 Introduction

1.1 Motivation

Union of Subspaces (UoS) model serves as an important model in statistical machine learning. Briefly speaking, UoS models those high-dimensional data, encountered in many real-world problems, which lie close to low-dimensional subspaces corresponding to several classes to which the data belong, such as hand-written digits (Hastie and Simard, 1998), face images (Basri and Jacobs, 2003), DNA microarray data (Parvaresh et al., 2008), and hyper-spectral images (Chen et al., 2011), to name just a few. A fundamental task in processing data points in UoS is to cluster these data points, which is known as Subspace Clustering (SC). Applications of SC has spanned all over science and engineering, including motion segmentation (Costeira and Kanade, 1998, Kanatani, 2001), face recognition (Wright et al., 2008), and classification of diseases (McWilliams and Montana, 2014) and so on. We refer the reader to the tutorial paper (Vidal, 2011) for a review of the development of SC.

Considering the wide applications of SC, numerous algorithms have been developed for subspace clustering (Tipping and Bishop, 1999, Tseng, 2000, Vidal et al., 2005, Yan and Pollefeys, 2006, Elhamifar and Vidal, 2009, Peng et al., 2018, Meng et al., 2018). Arguably, a series of two-step algorithms, referring to Sparse Subspace Clustering (SSC) and its variants (Elhamifar and Vidal, 2009, Liu et al., 2012, Dyer et al., 2013, Heckel and Bölcskei, 2015, Chen et al., 2017), are the most popular and efficient methods for solving SC, which first construct a random graph (or an adjacent matrix equivalently), named as Union of Subspaces-based Random Geometry Graph (UoS-RGG), depending on the relative position among data points, and then apply the spectral method (Ng et al., 2002, Von Luxburg, 2007) to obtain the clustering result.

In spite of all these algorithms that practically work well for many applications, theoretical guarantees are lacked for the accuracy of clustering of any SC algorithm. We note that although novel and often efficient subspace clustering techniques emerge all the time, establishing rigorous theory for such techniques is quite difficult and does not exist as of now. The fundamental difficulty in the analysis of SC algorithms may be the change of view required in treating UoS-RGG (or general Random Geometry Graph, RGG), which has non-independent edges, in contrast with the traditional approach to analyzing clustering algorithms via Stochastic Block Model (SBM) which assumes independent edges. Section 1.2 offers a detailed discussion of this difficulty, as well as a survey of the existing attempts in theoretical aspects. We therefore propose the critical question that this paper aims to explore:

•

Why does SC work, or more precisely, why does spectral method work for RGG or UoS-RGG?

This paper focuses on the analysis on the spectral method for UoS-RGG. We consider a naive and prototypical SC algorithm (Algorithm 1) here, and prove this algorithm, though oversimplified, can still deliver an almost correct clustering result even when the subspaces are quite close to each other and when the number of samples is far less than the subspace dimension (see Theorem 1). To the best of our knowledge, this is the first ever theory established to analyze the clustering error of SC algorithm. It not only constitutes the first theoretical guarantee for accuracy of subspace clustering, but also provides the interesting insight that the widely-conjectured oversampling requirement for subspace clustering is redundant, and that subspace clustering is quite robust in existence of closely aligned subspaces. We also verify our results by numerical experiments in Section 4. Although our theoretical results is proved only for the simplified algorithm we choose, it should be quite convincing that more carefully-designed SC algorithms would give even better performance than what we guarantee here, and our proof could serve as a prototype to the analysis of these algorithms.

1.2 Related Works and Challenges

We now briefly review the literature on the adjacent matrix and spectral method and discuss their shortcomings. Since this paper mainly deals with theory, we shall focus on theoretical aspects of existing results.

1.2.1 Analysis of Random Graphs for UoS

To analyze the random graphs associated to UoS model in an abstract setting without referring to any specific algorithms, most researches focus on the Subspace Detection Property (SDP, Soltanolkotabi et al., 2012, Liu et al., 2012, Soltanolkotabi et al., 2014), a property which indicates that there are no edge connections between the data points in different subspaces, but are many connections between the data points in the same subspace. Under some technical conditions on the parameters of SC, the random graphs constructed by a variety of SC algorithms have been proved to enjoy SDP. Readers may consult Section 3 in Soltanolkotabi et al. (2014) for details.

There are, however, two main deficiencies of SDP which render SDP hard to use in further analysis. The first one is that SDP does not imply a correct clustering result. Actually, one can easily construct a counter-example where SDP holds but the clustering result is unsatisfying. The second one is that SDP requires too restrictive conditions on affinity between subspaces and sampling rate to hold. These conditions are provably unnecessary, as will be demonstrated in Section 3 of this paper.

1.2.2 Analysis of Spectral Method for Random Graphs

Compared with SDP, a more concrete approach to analyze SC algorithm is to investigate the performance of spectral method on random graphs associated to UoS model. To this end, analysis of spectral method for general random graphs (not necessarily associated to UoS model) is relevant. Note that the spectral method is explored deeply in the literature of community detection, which is an important problem in statistics, computer vision, and image processing (Abbe, 2017). Stochastic Block Model (SBM) is a widely used theoretical model in this field, which we briefly introduce as follows. For simplicity, we consider the two-block case, where the vertices of random graph are divided into two “blocks”, i.e. sets of vertices that ought to be closely-related, each of size of $N/2$ . Then each edge of random graph is independently generated from the following distribution: for $p>q>0$ , vertices ${\bm{x}}_{i}$ and ${\bm{x}}_{j}$ are connected with probability $p$ if ${\bm{x}}_{i},{\bm{x}}_{j}$ belong to the same block, and with probability $q$ if they belong to different blocks. Given an instance of this graph, we would like to identify the two blocks. Recently, a series of theoretical works are devoted to analyze the performance of spectral method on this problem in different settings (Coja-Oghlan, 2010, Vu, 2014, Chin et al., 2015, Abbe et al., 2017), and extensions (Sankararaman and Baccelli, 2018).

As far as we know, all existing results make essential use of the independence of different edges, which is unfortunately not the case in SC algorithms. In fact, it is a generic and natural phenomenon in RGG that when ${\bm{x}}_{i},{\bm{x}}_{j}$ and ${\bm{x}}_{i},{\bm{x}}_{k}$ are connected, the probability that ${\bm{x}}_{j},{\bm{x}}_{k}$ are connected will be higher, hence the independence assumption does not hold for RGG.

With this fundamental gap in mind, it is crucial to develop a theory for RGG to provide a rigorous theoretical guarantee for SC algorithms.

2 Preliminaries and Problem Formulation

The generative model for data points in UoS we adapt in this paper is the semi-random model introduced in Soltanolkotabi et al. (2012), which assumes that the subspaces are fixed with points distributed uniformly at random on each subspace. This is arguably the simplest model providing a good starting point for a theoretical investigation. We assume the data consists of two clusters, corresponding to two fixed subspaces111It should be noticed that the number of subspaces is by no means crucial to the analysis. The results in this paper can be generalized to more subspaces easily. $S_{1},S_{2}$ in $\mathbb{R}^{n}$ , each with $N/2$ data points uniformly sampled from the unit spheres $\mathcal{S}_{1}^{d-1}$ and $\mathcal{S}_{2}^{d-1}$ respectively in $S_{1}$ and $S_{2}$ . Here $d$ is the subspace dimension and $n$ is the ambient dimension. The goal of SC is to cluster the normalized data points $\{{\bm{x}}_{i}\}_{1\leq i\leq N}$ .

Given the general description of SC, we turn our attention to a simple prototypical SC algorithm detailed in Algorithm 1, which we call Thresholding Inner-Product Subspace Clustering (TIP-SC). Considering that the angle between the data points in the same subspaces would be smaller statistically, we construct for some threshold $\tau\in(0,1)$ the random graph by computing its adjacent matrix ${\bm{A}}$ , where $A_{ij}=1$ if $i\neq j,|\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle|\geq\tau$ , and $A_{ij}=0$ otherwise. The TIP-SC algorithm concludes with applying the spectral clustering method on ${\bm{A}}$ .

The main task of this paper is to prove this simple algorithm can achieve a high clustering accuracy under fairly general condition, which will be done in the next section.

Notations.

Let ${\bm{U}}_{1},{\bm{U}}_{2}$ denote the orthonormal bases for the subspaces $S_{1},S_{2}$ , respectively, and $\lambda_{1}\geq\ldots\geq\lambda_{d}\geq 0$ denote the singular values of ${\bm{U}}_{1}^{\top}{\bm{U}}_{2}$ . We also use $S$ and $S^{\prime}$ to denote the subspaces to which ${\bm{x}}_{i}$ does and doesn’t belong, respectively. Then ${\bm{x}}_{i}={\bm{U}}\overline{{\bm{a}}}_{i}$ where ${\bm{U}}$ denotes the orthonormal bases for $S$ , ${\bm{a}}_{i}\overset{\mathrm{ind.}}{\sim}\mathcal{N}(\bm{0},\frac{1}{d}{\bm{I}}_{d})\in\mathbb{R}^{d}$ , and $\overline{{\bm{a}}}_{i}={\bm{a}}_{i}/\|{\bm{a}}_{i}\|$ denotes its normalization. We use $p,q$ to represent the probability that $A_{ij}=1$ for $j\neq i,{\bm{x}}_{j}\in S$ and ${\bm{x}}_{j}\in S^{\prime}$ , respectively. Conditioned on ${\bm{x}}_{i}$ , let $p_{i}$ denote the probability of $A_{ij}=1$ for $j\neq i,{\bm{x}}_{j}\in S$ , and $q_{i}$ denote the probability of $A_{ij}=1$ for $j,{\bm{x}}_{j}\in S^{\prime}$ . Denote

[TABLE]

Let ${\bm{u}},{\bm{v}}\in\mathbb{R}^{N}$ with $u_{i}=\frac{1}{\sqrt{N}}$ , and $v_{i}=\frac{1}{\sqrt{N}}$ , if ${\bm{x}}_{i}\in S_{1}$ , and $v_{i}=-\frac{1}{\sqrt{N}}$ , if ${\bm{x}}_{i}\in S_{2}$ , then ${\bm{v}}$ is the ground truth. ${\bm{W}}$ denotes the eigenspace corresponding to the top two eigenvalues of ${\bm{A}}$ , and ${\bm{w}}$ denotes the vector in ${\bm{W}}$ , which is perpendicular to the projection of ${\bm{u}}$ in ${\bm{W}}$ .

3 Error Rate of TIP-SC Algorithm

This section presents our main theoretical results concerning the performance of TIP-SC. By the perturbation analysis of ${\bm{A}}$ from ${\mathbb{E}}{\bm{A}}$ , the success of spectral method for SBM has been proved in various statistical assumptions. However, such analysis is insufficient to establish our result, since for UoS-RGG, the independence condition doesn’t hold, which is the crux leading to the failure of the existing methods for analyzing spectral method on random graph. As a substitute, we discover the conditional independence property for ${\bm{A}}$ , based on which we prove that the clustering result of TIP-SC is almost correct under some mild condition on affinity and sampling rate, which is explained in the following theorem.

Theorem 1.

Choosing $\tau=O\left(\frac{1}{\sqrt{d}}\right)$ such that $p=O(1)$ , there exists some numerical constant $c>0$ , such that whenever $\kappa>c\sqrt[4]{\frac{\log N}{d}}$ , the clustering error rate of TIP-SC is less than $O\left(\frac{(1+1/\rho)\log N}{\kappa^{2}d}\right)$ with probability at least $1-\mathrm{e}^{-\Omega(\log N)}$ .

Parameter selection is often critical for the success of algorithms. The above result suggests that a dense graph ( $p=O(1)$ ) is usually a good choice, which is quite different with SDP.

In this regime, the above result indicates that the algorithm works correctly in fairly broad conditions compared with existing analysis for SC. A fascinating insight revealed by the above theorem is that even when the number of samples $N\ll d$ , we can succeed to cluster the data set, which demonstrates the commonly accepted opinion that $\rho>1$ is necessary for SC is partially inaccurate.

To clarify the condition on $\kappa$ , namely on affinity, assume these two subspaces overlap in a smaller subspace of dimension $s$ , but are orthogonal to each other in the remaining directions. In this case, the affinity between the two subspaces is equal to $\sqrt{s/d}$ . Our assumption on $\kappa$ indicates that subspaces can have intersections of almost all dimensions, i.e., $s=(1-o(1))d$ . In contrast, previous works (Soltanolkotabi et al., 2012, 2014) imposes that the overlapping dimension should obey $s=o(1)d$ , so that the subspaces are practically orthogonal to each other.

In the noisy case, we assume each data point is of the form

[TABLE]

where ${\bm{x}}$ denotes the clean data used in the above theorem, and ${\bm{z}}\sim\mathcal{N}(0,\frac{\sigma^{2}}{n}{\bm{I}})$ is an independent stochastic noise term. We have the following robustness guarantee for TIP-SC.

Theorem 2.

Choosing $\tau=O\left(\frac{1}{\sqrt{d}}\right)$ such that $p=O(1)$ , there exists some numerical constant $c,\sigma^{*}>0$ , such that whenever $\kappa>c\sqrt[4]{\frac{\log N}{d}}$ and $\sigma<\sigma^{*}$ , the clustering error rate of TIP-SC is less than $O\left(\frac{(1+\sigma^{2}d/n)^{2}(1+1/\rho)\log N}{\kappa^{2}d}\right)$ with probability at least $1-\mathrm{e}^{-\Omega(\log N)}$ .

The proof is similar to that of Theorem 1, and both are deferred to Section 5.

4 Numerical Experiments

In this section, we perform numerical experiments validating our main results. We evaluate the algorithm and theoretical results based on the clustering accuracy. The impacts of $\kappa,\rho,p,q$ on the clustering accuracy are demonstrated. Besides, we also show the efficiency of TIP-SC in the presence of noise.

According to the definition of semi-random model, to save computation and for simplicity, the data are generated by the following steps.

Given $d\ll n$ and ${\rm aff}=\sqrt{s/d}$ , define ${\bm{e}}_{i}\in\mathbb{R}^{n}$ , whose entries are zero but the $i$ -th entry is one. Let ${\bm{U}}_{1}=[{\bm{e}}_{1},{\bm{e}}_{2},\ldots,{\bm{e}}_{d}]$ be the orthonormal basis for subspace for $S_{1}$ , and ${\bm{U}}_{2}=[{\bm{e}}_{d-s+1},{\bm{e}}_{d-s+2},\ldots,{\bm{e}}_{2d-s}]$ be the orthonormal basis for subspace for $S_{2}$ , such that the affinity between $S_{1}$ and $S_{2}$ is $\sqrt{s/d}$ .

2)

Given $N=\rho d$ , generate $N$ vectors ${\bm{a}}_{1},{\bm{a}}_{2},\ldots,{\bm{a}}_{N}\in\mathbb{R}^{d}$ independently from $\mathcal{N}(0,\frac{1}{d}{\bm{I}})$ . Let ${\bm{x}}_{i}={\bm{U}}_{1}\frac{{\bm{a}}_{i}}{\|{\bm{a}}_{i}\|}$ for $1\leq i\leq N/2$ and ${\bm{x}}_{i}={\bm{U}}_{2}\frac{{\bm{a}}_{i}}{\|{\bm{a}}_{i}\|}$ for $N/2+1\leq i\leq N$ .

3)

In the presence of noise, given $\sigma>0$ , generate $N$ random noise terms ${\bm{z}}_{1},{\bm{z}}_{2},\ldots,{\bm{z}}_{N}\in\mathbb{R}^{n}$ independently from $\mathcal{N}(0,\frac{\sigma^{2}}{n}{\bm{I}})$ . Let the normalized data of ${\bm{x}}_{i}+{\bm{z}}_{i}$ be the input of Algorithm 1.

Since there are too many factors we need to consider, we always observe the relation between two concerned quantities, while keep others being some predefined typical values, i.e., $d^{*}=100,n^{*}=5000,\kappa^{*}=1-\sqrt{1/2}\ (s^{*}=d/2),\rho^{*}=1$ , and $\tau$ is chosen to be $\tau^{*}$ such that the connection rate $\frac{p+q}{2}=0.2$ . We conduct the experiments in noiseless situations, except the last one which tests the robustness of Algorithm 1. Moreover, the curves are plotted by $100$ trials in each experiment, while the mean and the standard deviation are represented by line and error bar, respectively. We can find that the randomness is eliminated in all experiments when the error rate is small.

It is obvious that $p$ will decrease simultaneously if $q$ decreases by increasing $\tau$ , which is also demonstrated in Figure 1. Combining the result of the second experiment (c.f. Figure 2), we can find that it is better to make $p,q$ both large than to choose $q=0$ , although $q=0$ is suggested by SDP, which is consistent with our result, while shows that SDP is somewhat inadequate for SC.

In the third and fourth experiments, we inspect the impacts of affinity and sampling rate on the performance of TIP-SC. From Figure 3 and Figure 4, the claim that SC works well in fairly broad conditions is verified. In addition, according to (1), we have

[TABLE]

then the last experiment (c.f. Figure 5) shows that the algorithm is robust even though SNR is low.

5 Proof of Main Results

5.1 Proof of Theorem 1

Recall the definition of ${\bm{u}},{\bm{v}},{\bm{w}},{\bm{W}}$ in Section 2, and notice that analyzing the error rate, denoted by $\gamma$ , is equivalent to studying the difference between ${\bm{w}}$ and ${\bm{v}}$ . Without loss of generality we may assume that $\langle{\bm{w}},{\bm{v}}\rangle>0$ , thus the error rate is exactly

[TABLE]

To estimate $\gamma$ , it suffices to bound the distance between ${\bm{u}},{\bm{v}}$ and ${\bm{W}}$ .

By simple geometric consideration, we have

[TABLE]

where $\overline{{\bm{P}}_{\bm{W}}{\bm{u}}}$ denote the normalization of ${\bm{P}}_{\bm{W}}{\bm{u}}$ . Moreover, for any $\lambda,{\bm{x}}$ , we have

[TABLE]

where $\lambda_{3}({\bm{A}})$ denotes the third largest eigenvalue of ${\bm{A}}$ .

Summing up, for $\lambda_{1},\lambda_{2}>\lambda_{3}({\bm{A}})$ ,

[TABLE]

Considering that ${\mathbb{E}}\langle{\bm{A}}{\bm{u}},{\bm{u}}\rangle=p(N/2-1)+qN/2$ , we expect $\lambda_{1}=p(N/2-1)+qN/2$ is a good choice. Similarly, choose $\lambda_{2}=p(N/2-1)-qN/2$ .

From above discussion, to estimate $\gamma$ we need to:

•

Prove $\|{\bm{A}}{\bm{u}}-\lambda_{1}{\bm{u}}\|_{2}$ and $\|{\bm{A}}{\bm{v}}-\lambda_{2}{\bm{v}}\|_{2}$ are sufficiently small (see Lemma 3 and Lemma 4).

•

Prove $\lambda_{1}-\lambda_{3}({\bm{A}})$ and $\lambda_{2}-\lambda_{3}({\bm{A}})$ are sufficiently large, which is equivalent to showing $p-q$ is large enough (see Lemma 3) and $\lambda_{3}({\bm{A}})$ is small enough (see Lemma 5).

Before proceeding, we analyze the adjacent matrix ${\bm{A}}$ based on the conditional independence property, and provide probability estimations used in the proof of Theorem 1. Specifically, this refers to if conditioned on ${\bm{x}}_{i},i\in\mathcal{S}$ for some subset $\mathcal{S}$ of $[N]$ , $A_{ij}$ , for $j\in\mathcal{S}^{c}$ , are functions of ${\bm{x}}_{j}$ , respectively, and then are independent from each other.

Moreover, recalling the definition of ${\bm{x}}_{i},{\bm{a}}_{i}$ , on the collection of events $\mathcal{E}(t)$ given by the intersection of

[TABLE]

if conditioned on ${\bm{x}}_{i},i\in\mathcal{S}$ , $A_{ij}$ , for $j\in\mathcal{S}^{c}$ are nearly identically distributed, and for some $j\in\mathcal{S}^{c}$ , $A_{ij}$ , for $i\in\mathcal{S}$ are nearly independent from each other, which will be explained and employed many times in the following analysis. According to Lemma 7 and Lemma 8, there exist some constants $c_{1},c_{2}>0$ , such that

[TABLE]

For simplicity, use $\mathcal{E}$ to denote $\mathcal{E}\left(c_{1}\sqrt{\frac{\log N}{d}}\right)$ . In this work, we will always analyze the spectral method on the canonical event set $\mathcal{E}$ .

Let

[TABLE]

then

Lemma 1.

All $p_{i}$ are equal, and there exist constants $c_{1},c_{2}>0$ , such that

[TABLE]

where $\tau_{d}:=\sqrt{d}\tau$ and $t=c_{1}\sqrt{\frac{\log N}{d}}$ .

Proof.

Conditioned on ${\bm{x}}_{i}$ , for ${\bm{x}}_{j}\in S$

[TABLE]

where ${\bm{a}}_{i},{\bm{a}}_{j}\overset{\mathrm{ind.}}{\sim}\mathcal{N}(\bm{0},\frac{1}{d}{\bm{I}}_{d})\in\mathbb{R}^{d}$ , and $\overline{{\bm{a}}}_{i},\overline{{\bm{a}}}_{j}$ denote the normalization with ${\bm{x}}_{i}={\bm{U}}\overline{{\bm{a}}}_{i},{\bm{x}}_{j}={\bm{U}}\overline{{\bm{a}}}_{j}$ . According to the independence between ${\bm{a}}_{i},{\bm{a}}_{j}$ and the rotational invariance property of Gaussian random vectors, it is obviously that all $p_{i}$ are equal. Moreover, we have

[TABLE]

since $\langle\overline{{\bm{a}}}_{i},{\bm{a}}_{j}\rangle\sim\mathcal{N}(0,\frac{1}{d})$ is a Gaussian random variable independent with ${\bm{a}}_{i}$ , and

[TABLE]

according to Lemma 7. ∎

Lemma 2.

There exist constants $c_{1},c_{2}>0$ , such that for $t=c_{1}\sqrt{\frac{\log N}{d}}$ , on $\mathcal{E}(t)$ , we have

[TABLE]

where $\tau_{d}:=\sqrt{d}\tau$ .

Proof.

According to Remark 5 in Li and Gu (2017), we can choose ${\bm{U}}_{1},{\bm{U}}_{2}$ such that

[TABLE]

Without loss of generality, assume that ${\bm{x}}_{i}\in S_{1},{\bm{x}}_{j}\in S_{2}$ , then

[TABLE]

where ${\bm{a}}_{i},{\bm{a}}_{j}\overset{\mathrm{ind.}}{\sim}\mathcal{N}(\bm{0},\frac{1}{d}{\bm{I}}_{d})\in\mathbb{R}^{d}$ , and $\overline{{\bm{a}}}_{i},\overline{{\bm{a}}}_{j}$ denote the normalization with ${\bm{x}}_{i}={\bm{U}}_{1}\overline{{\bm{a}}}_{i},{\bm{x}}_{j}={\bm{U}}_{2}\overline{{\bm{a}}}_{j}$ . In addition, the definition of $\mathcal{E}(t)$ gives,

[TABLE]

then according to Lemma 7

[TABLE]

and similarly,

[TABLE]

∎

Specifically, according to the above two lemmas about $p_{i},q_{i}$ , we can easily get the following lemma.

Lemma 3.

Choose $\tau_{d}=O(1)$ , then $p=\Omega(1)$ . Moreover, on $\mathcal{E}$ , there exists some constant $c>0$ , such that if $\kappa=1-{\rm aff}^{2}>c\sqrt{\frac{\log N}{d}}$ ,

[TABLE]

and

[TABLE]

Having finished the calculation about the probability of each entry, we now turn to the overall properties of ${\bm{A}}$ .

Lemma 4.

Conditioned on ${\bm{x}}_{i}$ , for any $t>0$

[TABLE]

and

[TABLE]

Proof.

Given ${\bm{x}}_{i}$ , it can be easily checked that the angels between ${\bm{x}}_{j}$ and ${\bm{x}}_{i}$ are independent with each other, then $A_{ij}$ are conditionally independent Bernoulli random variables. Hence, according to Lemma 9, the results is obvious. ∎

In the next lemma, we will analyze the eigenvalue of ${\bm{A}}$ .

Lemma 5.

For $t=c_{1}\sqrt{\frac{\log N}{d}}$ , on $\mathcal{E}(t)$ , with probability at least $1-\mathrm{e}^{-c_{2}\log N}$ ,

[TABLE]

where $\lambda_{3}({\bm{A}})$ denotes the third largest eigenvalue of ${\bm{A}}$ , and $c,c_{1},c_{2}>0$ are some constants.

Proof.

We transfer the estimation of $\lambda_{3}({\bm{A}})$ to bounding $\lambda_{\max}\left({\bm{E}}\right)$ using Lemma 10, i.e.,

[TABLE]

where ${\bm{u}},{\bm{v}}$ are defined in Section 2, and

[TABLE]

then $E_{ij}=-p$ , if $i=j$ , $E_{ij}=A_{ij}-p$ , if ${\bm{x}}_{j}\in S$ and $E_{ij}=A_{ij}-q$ , if ${\bm{x}}_{j}\in S^{\prime}$ .

The analysis of $\lambda_{\max}\left({\bm{E}}\right)$ is based on the decoupling technique. According to Lemma 11, let $\mathcal{S}$ be a random subset of $[N]$ with average size $N/2$ , then

[TABLE]

where ${\bm{E}}_{\mathcal{S},\mathcal{S}^{c}}$ denotes the sub-matrix of ${\bm{E}}$ including the rows from $\mathcal{S}$ and columns from $\mathcal{S}^{c}$ , and $\|\cdot\|_{\mathrm{op}}$ denotes the operator norm.

To analyze $\|{\bm{E}}_{\mathcal{S},\mathcal{S}^{c}}\|_{\mathrm{op}}$ , we first condition on $\mathcal{S}$ and ${\bm{x}}_{j},j\in\mathcal{S}^{c}$ , and for $i\in\mathcal{S}$ , let $\Gamma_{i}:={\bm{E}}_{i,\mathcal{S}^{c}},{\bm{R}}_{i}:={\mathbb{E}}\Gamma_{i}^{\top}\Gamma_{i}$ , and $L:=\max_{i}\|\Gamma_{i}\|^{2}$ , then $\Gamma_{i}$ are independent with each other. On $\mathcal{E}$ ,

[TABLE]

Moreover, for the diagonal entries of ${\bm{R}}_{i}$ ,

[TABLE]

On the other hand, for the off-diagonal entries of ${\bm{R}}_{i}$ , if ${\bm{x}}_{j},{\bm{x}}_{k}\in S$ ,

[TABLE]

since $\langle{\bm{x}}_{j},{\bm{x}}_{k}\rangle\leq t$ . With similar analysis on the cases ${\bm{x}}_{j}\in S^{\prime},{\bm{x}}_{k}\in S$ and ${\bm{x}}_{j},{\bm{x}}_{k}\in S^{\prime}$ , we have the off-diagonal entries of ${\bm{R}}_{i}$ are less than $p^{2}t$ . Hence,

[TABLE]

and Lemma 12 gives, for $0<\theta<3/L$ ,

[TABLE]

Then

[TABLE]

Hence, with probability at least $1-\mathrm{e}^{-c_{2}\log N}$ ,

[TABLE]

Summing up,

[TABLE]

We conclude the proof. ∎

Now, we have all the ingredients for the proof of Theorem 1.

Proof of Theorem 1.

We begin with some inequalities for estimating the error. We have

[TABLE]

According to Lemma 4, for all $1\leq i\leq N$ , we have, with probability at least $1-\exp(-\Omega(\log N))$ ,

[TABLE]

and

[TABLE]

On the other hand, Lemma 3 gives, with probability at least $1-\exp(-\Omega(\log N))$ ,

[TABLE]

Summing up, we have, with probability at least $1-\exp(-\Omega(\log N))$ ,

[TABLE]

Similarly, with probability at least $1-\exp(-\Omega(\log N))$ ,

[TABLE]

According to Lemma 5, for $t=O\left(\sqrt{\frac{\log N}{d}}\right)$ , with probability at least $1-\exp(-\Omega(\log N))$ , the third largest eigenvalue of ${\bm{A}}$ satisfies

[TABLE]

With these estimations at hand, recall

[TABLE]

Lemma 3 gives $p\pm q\gtrsim 1-{\rm aff}^{2}$ , then we have

[TABLE]

We conclude the proof. ∎

5.2 Proof of Theorem 2

Robustness analysis can be completed by following the similar analysis method. We provide the differences in the analysis of noise, while omit the details.

Here, we only need to pay attention to the changes of Lemma 3, Lemma 4, and Lemma 5, when adding noise. Notice that the noise terms do not destroy the wonderful conditional independence property, then it’s obvious that except the estimation for $p-q$ , all other bounds still hold in a similar way. Through simple calculation, the contribution of noise has the form

[TABLE]

Taking this change into account, we can get the result easily.

6 Conclusion

This paper establish a theory to analyze spectral method for Random Geometry Graph constructed by data points from Union of Subspaces. Based on this theory, we demonstrate the efficiency of Subspace Clustering in fairly broad conditions. To the best of our knowledge, the clustering accuracy has not been shown in the prior literature. The insights and analysis techniques developed in this paper might also have implications for other Random Geometry Graph.

Moving forward, one issue is to understand UoS-RGG constructed by more complex strategy, such as SSC. Additionally, ideally one would desire an exact recovery by spectral method, which needs entrywise analysis. We leave these to future investigation.

Appendix A Auxiliary Lemmas

In this subsection, we introduce some well-known results about Gaussian, Bernoulli random variables, and matrices (Vershynin, 2010), which shall be used to analyze the properties of the adjacent matrix ${\bm{A}}$ . We omit the proof for most of them.

Lemma 6 (Concentration in Gauss space (Ledoux, 2001)).

Let $f$ be a real valued Lipschitz function on $\mathbb{R}^{n}$ with Lipschitz constant $K$ , i.e.,

[TABLE]

for any ${\bm{x}}_{1},{\bm{x}}_{2}\in\mathbb{R}^{n}$ (such functions are also called K-Lipschitz). Let $X\sim\mathcal{N}(\bm{0},{\bm{I}}_{n})$ be the standard Gaussian random vector in $\mathbb{R}^{n}$ , then for every $t>0$ , one has

[TABLE]

Lemma 7.

Assume ${\bm{a}}\sim\mathcal{N}(\bm{0},\frac{1}{d}{\bm{I}}_{d})\in\mathbb{R}^{d}$ , then for any $t>0$

[TABLE]

Moreover, for $0\leq\lambda_{1},\ldots,\lambda_{d}\leq 1$ and $t>0$

[TABLE]

Proof.

Let

[TABLE]

then by calculation

[TABLE]

Hence, $f({\bm{x}})$ is $1-Lipschitz$ and according to Lemma 6, we have

[TABLE]

Take $f(x)=-\sqrt{\sum_{i}\lambda_{i}^{2}x_{i}^{2}}$ , then similarly

[TABLE]

Moreover, $\left({\mathbb{E}}\sqrt{\sum_{i}\lambda_{i}^{2}a_{i}^{2}}\right)^{2}\leq{\mathbb{E}}\sum_{i}\lambda_{i}^{2}a_{i}^{2}=\frac{\sum_{i}\lambda_{i}^{2}}{d}$ and

[TABLE]

Taking $\lambda_{i}=1$ , we prove (2). Taking square, we prove (3). ∎

Here, we also use $\langle{\bm{a}},{\bm{b}}\rangle$ to denote the angle between ${\bm{a}}$ and ${\bm{b}}$ .

Lemma 8 (Concentration of measure (Ledoux, 2001)).

Assume ${\bm{a}},{\bm{b}}\overset{\mathrm{ind.}}{\sim}\mathcal{N}(\bm{0},\frac{1}{d}{\bm{I}}_{d})\in\mathbb{R}^{d}$ , then for any $t>0$

[TABLE]

Lemma 9.

$X_{1},X_{2},\ldots,X_{N}$ are generated independently from ${\rm Bern}(p)$ , then for any $t>0$

[TABLE]

Proof.

According to Bernstein’s Inequality, the conclusion is obvious. ∎

Lemma 10.

For any symmetric matrix ${\bm{M}}\in\mathbb{R}^{n\times n}$ ,

[TABLE]

where $S_{n-i}$ denotes the subspace of $\mathbb{R}^{n}$ of dimension $n-i$ .

Proof.

This is a basic property of eigenvalues. ∎

We define a random subset $\mathcal{S}$ of $[N]$ with average size $\alpha N$ as follows. For all $i\in[N]$ , $i$ belongs to $\mathcal{S}$ with probability $\alpha$ independently from each other. Then we state an elementary decoupling lemma for double arrays here.

Lemma 11 (Decoupling (Helmers, 2000)).

Consider a double array of real numbers $(a_{ij})_{i,j=1}^{2N}$ such that $a_{ii}=0$ for all $i$ . Then

[TABLE]

where $\mathcal{S}$ is a random subset of $[N]$ with average size $N/2$ .

Lemma 12 (Matrix Bernstein: Mgf and Cgf Bound, Lemma 6.6.2 (Tropp et al., 2015)).

Suppose that ${\bm{X}}$ is a random Hermitian matrix that satisfies

[TABLE]

then for $0<\theta<3/L$

[TABLE]

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbe (2017) E. Abbe. Community detection and stochastic block models: recent developments. Journal of Machine Learning Research , 18(1):6446–6531, 2017.
2Abbe et al. (2017) E. Abbe, J. Fan, K. Wang, and Y. Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. ar Xiv preprint ar Xiv:1709.09565 , 2017.
3Basri and Jacobs (2003) R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence , (2):218–233, 2003.
4Chen et al. (2011) Y. Chen, N. M. Nasrabadi, and T. D. Tran. Hyperspectral image classification using dictionary-based sparse representation. IEEE Transactions on Geoscience and Remote Sensing , 49(10):3973–3985, 2011.
5Chen et al. (2017) Y. Chen, G. Li, and Y. Gu. Active orthogonal matching pursuit for sparse subspace clustering. IEEE Signal Processing Letters , 25(2):164–168, 2017.
6Chin et al. (2015) P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in sparse graphs: A spectral algorithm with optimal rate of recovery. In Conference on Learning Theory , pages 391–423, 2015.
7Coja-Oghlan (2010) A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Combinatorics, Probability and Computing , 19(2):227–284, 2010.
8Costeira and Kanade (1998) J. P. Costeira and T. Kanade. A multibody factorization method for independently moving objects. International Journal of Computer Vision , 29(3):159–179, 1998.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Theory of Spectral Method for Union of Subspaces-Based Random Geometry Graph

Abstract

1 Introduction

1.1 Motivation

1.2 Related Works and Challenges

1.2.1 Analysis of Random Graphs for UoS

1.2.2 Analysis of Spectral Method for Random Graphs

2 Preliminaries and Problem Formulation

Notations.

3 Error Rate of TIP-SC Algorithm

Theorem 1**.**

Theorem 2**.**

4 Numerical Experiments

5 Proof of Main Results

5.1 Proof of Theorem 1

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Proof of Theorem 1.

5.2 Proof of Theorem 2

6 Conclusion

Appendix A Auxiliary Lemmas

Lemma 6** (Concentration in Gauss space (Ledoux, 2001)).**

Lemma 7**.**

Proof.

Lemma 8** (Concentration of measure (Ledoux, 2001)).**

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

Lemma 11** (Decoupling (Helmers, 2000)).**

Lemma 12** (Matrix Bernstein: Mgf and Cgf Bound, Lemma 6.6.2 (Tropp et al., 2015)).**

Theorem 1.

Theorem 2.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6 (Concentration in Gauss space (Ledoux, 2001)).

Lemma 7.

Lemma 8 (Concentration of measure (Ledoux, 2001)).

Lemma 9.

Lemma 10.

Lemma 11 (Decoupling (Helmers, 2000)).

Lemma 12 (Matrix Bernstein: Mgf and Cgf Bound, Lemma 6.6.2 (Tropp et al., 2015)).