Clustering Degree-Corrected Stochastic Block Model with Outliers

Xin Qian; Yudong Chen; Andreea Minca

arXiv:1906.03305·cs.LG·June 11, 2019

Clustering Degree-Corrected Stochastic Block Model with Outliers

Xin Qian, Yudong Chen, Andreea Minca

PDF

Open Access

TL;DR

This paper introduces a convex-optimization clustering algorithm for degree-corrected stochastic block models that effectively handles outliers, achieving exact recovery and lower error rates in heterogeneous networks.

Contribution

It presents a novel convex-optimization method with penalization for clustering in the presence of outliers, improving accuracy over existing algorithms.

Findings

01

Achieves exact cluster recovery under mild conditions.

02

Performs well on networks with Pareto degree distributions.

03

Reduces error rates compared to prior methods.

Abstract

For the degree corrected stochastic block model in the presence of arbitrary or even adversarial outliers, we develop a convex-optimization-based clustering algorithm that includes a penalization term depending on the positive deviation of a node from the expected number of edges to other inliers. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. Our synthetic experiments show that our algorithm performs well on heterogeneous networks, and in particular those with Pareto degree distributions, for which outliers have a broad range of possible degrees that may enhance their adversarial power. We also demonstrate that our method allows for recovery with significantly lower error rates compared to existing algorithms.

Figures8

Click any figure to enlarge with its caption.

Equations288

A = P [K Z^{T} Z W] P^{T} = P K_{11} ⋮ K_{1 r}^{T} Z_{1}^{T} \dots ⋱ \dots \dots K_{1 r} ⋮ K_{rr} Z_{r}^{T} Z_{1} ⋮ Z_{r} W P^{T},

A = P [K Z^{T} Z W] P^{T} = P K_{11} ⋮ K_{1 r}^{T} Z_{1}^{T} \dots ⋱ \dots \dots K_{1 r} ⋮ K_{rr} Z_{r}^{T} Z_{1} ⋮ Z_{r} W P^{T},

X = P^{T} J_{l_{1}} 0 ⋮ 0 * 0 J_{l_{2}} ⋮ 0 * \dots \dots ⋱ \dots \dots 00 ⋮ J_{l_{r}} * * * ⋮ * * P^{T},

X = P^{T} J_{l_{1}} 0 ⋮ 0 * 0 J_{l_{2}} ⋮ 0 * \dots \dots ⋱ \dots \dots 00 ⋮ J_{l_{r}} * * * ⋮ * * P^{T},

X \in R^{N \times N} min

X \in R^{N \times N} min

X is a partition matrix.

X \in R^{N \times N} min

X \in R^{N \times N} min

X ⪰ 0,

0 \leq X \leq J,

X \in R^{N \times N} min

X \in R^{N \times N} min

X ⪰ 0,

0 \leq X \leq J .

G_{a} := i \in C_{a}^{*} \sum θ_{i} .

G_{a} := i \in C_{a}^{*} \sum θ_{i} .

H_{a} := b = 1 \sum r G_{b} B_{ab} .

H_{a} := b = 1 \sum r G_{b} B_{ab} .

X min

X min

X ⪰ 0,

0 \leq X \leq J .

G_{a} \geq l_{m i n} \overset{ˉ}{θ}_{m i n} and n \overset{ˉ}{θ} q^{-} \leq H_{a} \leq n \overset{ˉ}{θ} p^{+}

G_{a} \geq l_{m i n} \overset{ˉ}{θ}_{m i n} and n \overset{ˉ}{θ} q^{-} \leq H_{a} \leq n \overset{ˉ}{θ} p^{+}

δ \geq c_{0} {\frac{p ^{+} lo g n}{θ _{m i n} G _{m i n}} + \frac{α n θ ˉ p ^{+}}{G _{m i n}} + \frac{θ _{m a x} p ^{+} n lo g n}{G _{m i n} θ _{m i n}} + \frac{lo g n}{G _{m i n} θ _{m i n}} + \frac{m r}{θ _{m i n} G _{m i n}} + \frac{m}{α θ _{m i n} G _{m i n}}}

δ \geq c_{0} {\frac{p ^{+} lo g n}{θ _{m i n} G _{m i n}} + \frac{α n θ ˉ p ^{+}}{G _{m i n}} + \frac{θ _{m a x} p ^{+} n lo g n}{G _{m i n} θ _{m i n}} + \frac{lo g n}{G _{m i n} θ _{m i n}} + \frac{m r}{θ _{m i n} G _{m i n}} + \frac{m}{α θ _{m i n} G _{m i n}}}

1 \leq a < b \leq r max \frac{B _{ab} + δ}{H _{a} H _{b}} < λ < 1 \leq a \leq r min \frac{B _{aa} - δ}{H _{a}^{2}}

1 \leq a < b \leq r max \frac{B _{ab} + δ}{H _{a} H _{b}} < λ < 1 \leq a \leq r min \frac{B _{aa} - δ}{H _{a}^{2}}

α \geq c_{1} \frac{m}{H ^{-}},

α \geq c_{1} \frac{m}{H ^{-}},

X = P J_{l_{1}} Z_{1}^{T} ⋱ \dots J_{l_{r}} Z_{r}^{T} Z_{1} ⋮ Z_{r} W P^{T},

X = P J_{l_{1}} Z_{1}^{T} ⋱ \dots J_{l_{r}} Z_{r}^{T} Z_{1} ⋮ Z_{r} W P^{T},

δ \geq c_{0} {\frac{p lo g n}{l _{m i n}} + \frac{α n p}{l _{m i n}} + \frac{n q lo g n}{l _{m i n}} + \frac{m r}{l _{m i n}} + \frac{m}{α l _{m i n}}},

δ \geq c_{0} {\frac{p lo g n}{l _{m i n}} + \frac{α n p}{l _{m i n}} + \frac{n q lo g n}{l _{m i n}} + \frac{m r}{l _{m i n}} + \frac{m}{α l _{m i n}}},

\frac{q + δ}{f ^{2}} < λ < \frac{p - δ}{f ^{2}},

α \geq c_{1} \frac{m}{f},

δ \geq c_{0} {\frac{p lo g n}{θ _{m i n} G _{m i n}} + \frac{q n lo g n}{G _{m i n}} \cdot \frac{θ _{m a x}}{θ _{m i n}}},

δ \geq c_{0} {\frac{p lo g n}{θ _{m i n} G _{m i n}} + \frac{q n lo g n}{G _{m i n}} \cdot \frac{θ _{m a x}}{θ _{m i n}}},

1 \leq a < b \leq r max \frac{q + δ}{H _{a} H _{b}} < λ < 1 \leq a \leq r min \frac{p - δ}{H _{a}^{2}} .

A = P [K Z^{T} Z W] P^{T} .

A = P [K Z^{T} Z W] P^{T} .

d_{i} \leq 4 θ_{i} H^{+} .

d_{i} \leq 4 θ_{i} H^{+} .

W

W

Z_{a}

E

E

= α diag (d_{(1)}^{*}) + λ d_{(1)} d_{(1)}^{T} - K_{11} ⋮ λ d_{(r)} d_{(1)}^{T} - K_{1 r}^{T} Z_{1}^{T} \dots ⋱ \dots \dots λ d_{(1)} d_{(r)}^{T} - K_{1 r} ⋮ α diag (d_{(r)}^{*}) + λ d_{(r)} d_{(r)}^{T} - K_{rr} Z_{r}^{T} Z_{1} ⋮ Z_{r} W .

min

min

x_{a} \geq 0 for 1 \leq a \leq r,

a = 1 \sum r x_{a}^{T} (e_{a} e_{j}^{T}) x_{a} \leq 1 for 1 \leq j \leq m,

Ξ = diag {ξ_{1}, \dots, ξ_{r}}

Ξ = diag {ξ_{1}, \dots, ξ_{r}}

W x_{a} + Z_{a}^{T} 1_{l_{a}}

W x_{a} + Z_{a}^{T} 1_{l_{a}}

ξ_{j} (1 - a = 1 \sum r x_{a}^{T} (e_{a} e_{j}^{T}) x_{a})

⟨ x_{a}, β_{a} ⟩

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Complex Network Analysis Techniques · Adversarial Robustness in Machine Learning

Full text

Clustering Degree-Corrected Stochastic Block Model with Outliers

Xin Qian Yudong Chen Andreea Minca Northwestern University, Industrial Engineering and Management Sciences, Ithaca, NY, 14850, USA, email: [email protected] University, School of Operations Research and Information Engineering, Ithaca, NY, 14850, USA, email: [email protected] University, School of Operations Research and Information Engineering, Ithaca, NY, 14850, USA, email: [email protected].

Abstract

For the degree corrected stochastic block model in the presence of arbitrary or even adversarial outliers, we develop a convex-optimization-based clustering algorithm that includes a penalization term depending on the positive deviation of a node from the expected number of edges to other inliers. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. Our synthetic experiments show that our algorithm performs well on heterogeneous networks, and in particular those with Pareto degree distributions, for which outliers have a broad range of possible degrees that may enhance their adversarial power. We also demonstrate that our method allows for recovery with significantly lower error rates compared to existing algorithms.

1 Introduction

Clustering nodes in a complex network is one of the major challenges in network science. Various aspects of this problem have been studied by researchers across different fields including computer science, statistics, operation research, probability and physics; for a partial list of work in this line, see Airoldi et al. (2008); Bickel and Chen (2009); Bordenave et al. (2018); Chen et al. (2014b); Clauset et al. (2004); Condon and Karp (2001); Dasgupta et al. (2004); Demaine et al. (2006); Fortunato and Barthelemy (2007); Fortunato (2010); Guédon and Vershynin (2015); Lei et al. (2015); Massoulié (2014); Newman and Girvan (2004); Nowicki and Snijders (2001); Rohe et al. (2011); Shamir and Tishby (2011); Yi et al. (2012); Zhang et al. (2014).

A variety of clustering algorithms have been developed, such as modularity maximization (Newman, 2006; Karrer and Newman, 2011), graph-cut based methods (Condon and Karp, 2001; Bollobás and Scott, 2004), model/likelihood-based methods (Zhao et al., 2012; Le et al., 2016), spectral clustering (McSherry, 2001; Shi and Malik, 2000; Ng et al., 2002), hierarchical clustering methods (Balcan and Gupta, 2010), and more recently algorithms for dynamic networks (Hajek and Sankagiri, 2018).

A main goal and driving force for the development of new algorithms is to tackle features of real world datasets, among those size and heterogeneity of the networks. In many cases, promising algorithms remain heuristic and are not yet amenable to rigorous performance analysis. On the theory side, obtaining provable performance guarantees is hindered by the fact that each realistic feature added to the network model significantly increases the complexity of the analysis.

A recent line of research on clustering makes use of convex optimization to achieve both computational efficiency and statistical quality guarantees (Chen et al., 2014a, b; Ames and Vavasis, 2011; Cai and Li, 2015; Guédon and Vershynin, 2015). By relaxing the original combinatorial problem into a semidefinite program (SDP), tractable clustering algorithms are developed which, under very general statistical settings, provably produce a high quality clustering. With a few exceptions such as Dasgupta et al. (2004); Karrer and Newman (2011); Chen et al. (2018), most previous work in this area considers homogeneous networks, in which different nodes exhibit similar statistical properties.

In this paper, we focus on designing clustering algorithms applicable to heterogeneous networks with outliers, and on providing theoretical guarantees for them. In particular, we would like to simultaneously capture the following three key features common in real world networks:

•

Clustering/community structure: Nodes may belong to several different groups, where nodes in the same group are more likely to connect to each other than those in different groups.

•

Heterogeneous degrees: It is well-documented that real-world networks have heterogeneous node degrees. In particular, the degrees of nodes, even those in the same cluster, may exhibit heavy-tailed distribution (Newman, 2010).

•

Arbitrary outliers: There may exist a set of nodes that do not belong to any clusters and have arbitrary, even adversarial connection patterns.

Many networks have these properties. For example, the political blogs networks (Adamic and Glance, 2005) contain blogs that are mostly democratic or republican-oriented, some of whom have significantly more followers than others, but there are blogs that are associated with neither political groups. Another important example is given by financial networks. These consist of thousands of nodes at most but face a lot of heterogeneity. More importantly, the number of clusters is known and small. Clusters represent classical investment strategies, while some of the firms are multi-strategy and can be thought of as outliers (Guo et al., 2016). Arbitrary outliers could also correspond to non-bank firms connected in financial derivatives markets (Peltonen et al., 2014). Last and probably most important, for genomic networks, there are a variety of co-expression networks that suffer from the presence of outliers. For example, cancer-type-specific co-expression networks are medium sized network of 10-20 thousand genes and network analysis and clustering can be useful to identify prognostic genes for some types of cancer, Yang et al. (2014).

Note that the outliers may be different in nature among themselves, so one cannot simply treat them as an additional cluster and apply a standard clustering algorithm. Indeed, many existing methods, such as spectral clustering, are known to perform poorly in the presence of outliers even in small datasets (Cai and Li, 2015).

1.1 Our Contributions

Motivated by the considerations above, we consider a network model that accounts for the combined features. The clustering algorithm we consider is based on Semidefinite Programming (SDP) relaxation of the Modularity Maximization approach. We introduce a novel regularization term that penalizes outliers with unusual connection patterns beyond those implied by the inlier heterogeneity.

Two existing works serve as the foundation of our analysis: the Stochastic Block Model (SBM) with Outliers in Cai and Li (2015) and the Degree-Corrected Stochastic Block Model (DCSBM) in Chen et al. (2018). As usual, the complexities arising by adding multiple realistic features largely surpass complexities of handling any of these features alone. In particular, our analysis needs to address the following challenges:

(1) When inliers are homogeneous, a node with unusual degree can be immediately recognized as an outlier. Therefore, for an outlier to hide, it must have a degree that is similar to inliers’ degree. This limits significantly the power of the adversary. In real networks, however, a node with unusual degree may not be an outlier – it can well belong to one of the clusters, and have very high or low degree simply because this node is more popular/unpopular than other nodes in the same cluster. In this sense, we are facing the more challenging problem where the outliers have more freedom and do not need to restrict their degrees. The ability of the outliers to select across a broad range of degrees (especially when the inliers’ degree distribution is heavy-tailed) makes it crucial for an algorithm to look into the detailed connectivity patterns of the nodes (who connects to whom) rather than just their degrees (how many they connect to), even more so than SBM.

(2) We use the primal-dual witness approach. However, in proving the necessary bounds for the recovered solution, we need to bound separately the contributions from nodes with different degrees. The heterogeneous nature of the nodes’ connectivity complicates the analysis on the distribution of edges. We need to obtain sufficiently tight individual bounds and ensure the correct dependence on the individual degrees such that the aggregate bound is sufficiently strong. In contrast, a worse case bound in terms of the maximum degree would be too loose. Moreover, in the degree-corrected set-up, the definiteness of the adjacency matrix becomes worse and the homogeneous penalization on diagonal terms is not enough to recover the true clusters. We instead introduce a term that depends on the degree of each node, namely it takes the form $\alpha\text{diag}\{d^{*}\}$ , where $d^{*}$ is depends both on the degree vector $d$ and a control on the expected number of edges to other inliers.

By addressing these points, we provide theoretical guarantees for the exact recovery of the inlier nodes with high probability. In particular, we impose no assumptions on the outlier nodes other than their cardinality. We provide an explicit and non-asymptotic condition for exact recovery. Namely, we request that the density gap (difference between the intra- and inter-cluster edge density) must be larger than an expression based on the natural problem parameters, such edge densities, amount of degree heterogeneity, sizes of the clusters, and number of inliers/outliers. We also give explicit conditions on the tuning parameters of the algorithm. Surprisingly, the condition for recovery does not contain an “nm” term as in (Cai and Li, 2015) and instead contains two terms in “n” and respectively ” $\sqrt{nlogn}$ ”.

The applicability of our model is to networks of several thousands of nodes. These are medium sized networks, which may arise in various applications and are subject to the real-world features described above. We provide numerical results based on synthetic data for a network of size in the range $n=400$ to $n=1000$ , divided into $r=2$ clusters and in the presence of a varying number of outliers, $m\in[10,30]$ . The degrees are following a heavy-tailed Pareto distribution, with varying shape parameters. We compare the misclassification rate for our algorithm to state of the art algorithms, such as spectral clustering (Zhang et al., 2014), SCORE (Jin, 2015) and Cai-Li (Cai and Li, 2015). Our results significantly improve the quality of recovery. Notably, the performance of the algorithm is relatively unhindered even under very heterogeneous degree distributions and in the presence of a large number of outliers. In contrast, other algorithms have a sharp increase in the misclassification rate in such settings.

1.2 Notation

Matrices are denoted by bold capital letters, vectors by bold lower-case letters, and scalars by normal letter. The notation $\bm{X}\succeq\bm{0}$ means the matrix $\bm{X}$ is positive semidefinite (psd). For two matrices $\bm{X}$ and $\bm{Y}$ of the same dimension, we denote their trace inner product by by $\langle\bm{X},\bm{Y}\rangle:=\text{Trace}(\bm{X}^{T}\bm{Y})$ , and we write $\bm{X}\leq\bm{Y}$ if $X_{ij}\leq Y_{ij}$ for all $i$ and $j$ . For an integer $k$ , let $[k]:=\{1,2,\ldots,k\}.$ We use $\bm{I}$ to denote the identity matrix, $\bm{J}$ the matrix with all entries equal to 1, and ${\rm diag}(\bm{u})$ the diagonal matrix whose $i$ -th diagonal entry is $u_{i}$ . We use notations like $c,c_{0},C$ etc. to denote numerical constants independent of the other model parameters (in particular, the number of nodes $n$ ). Finally, for two quantities $x\equiv x_{n}$ and $y\equiv y_{n}$ that may depend on $n$ , we write $x\asymp y$ if they are of the same order, that is, there exist numerical constant $c_{1}$ and $c_{2}$ such that $c_{1}y\leq x\leq c_{2}y$ .

2 Problem Setup

We consider a Degree-Corrected Stochastic Block Model with Outliers, which is a generative model for a random graph with underlying clustering structures.

In particular, the model involves a graph $\mathcal{G}=({V},\bm{A})$ . Here ${V}=[N]=[n+m]$ is a set of $N:=n+m$ vertices, where $n$ inliers are partitioned into $r$ unknown clusters $C_{1}^{\star},C_{2}^{\star},\cdots,C_{r}^{\star}$ , and the other $m$ nodes are outliers. The adjacency matrix $\bm{A}\in\{0,1\}^{(n+m)\times(n+m)}$ , where ${A}_{ij}=1$ if and only if nodes $i$ and $j$ are connected, are generated randomly as follows. Each pair of distinct inliers $i\in C_{a}^{*}$ and $j\in C_{b}^{*}$ are connected by an undirected edge with probability $\theta_{i}\theta_{j}B_{ab}$ , independently of all others. Here the vector $\bm{\theta}=({\theta}_{1},{\theta}_{2},\cdots,{\theta}_{n})^{\top}\in\mathbb{R}_{+}^{n}$ is referred to as the degree heterogeneity parameters of the nodes. The symmetric matrix $\bm{B}=(B_{ab})\in\mathbb{R}_{+}^{r\times r}$ is called the connectivity matrix of the clusters, and specifies the likelihood of connectivity of the inliers. The connections of the $m$ outliers among each other and to the inliers are arbitrary; they may depend on the underlying clusters and the realization of edges between the inliers, and may even be chosen adversarially.

Note that the above model is a generalization of several well-known models. When there are no outliers ( $m=0$ ) and $\theta_{i}\equiv 1$ is uniform, the model reduces to the classical SBM (Holland et al., 1983). If $m=0$ but $\theta_{i}$ is allowed to vary across $i$ , it becomes the standard DCSBM (Dasgupta et al., 2004; Karrer and Newman, 2011). Finally, when $\theta_{i}\equiv 1$ but $m$ may be non-zero, it coincides with the setting considered in Cai and Li (2015), i.e., the SBM with outliers.

For future development, it is convenient to write the adjacency matrix $A$ in a block form according to the clustering structure

[TABLE]

where the block matrices above have the following interpretations:

•

$\bm{W}\in\{0,1\}^{m\times m}$ is a symmetric 0-1 matrix representing the connection within the outliers. Under our model, $\bm{W}$ is arbitrary.

•

$\bm{Z}\in\{0,1\}^{n\times m}$ is a 0-1 matrix representing the connection between inliers and outliers; in particular, $\bm{Z_{a}}$ is the adjacency matrix between the outliers and the $a$ -th inlier cluster. Under our model, $\bm{Z}$ is arbitrary.

•

$\bm{K}\in\{0,1\}^{n\times n}$ is a symmetric 0-1 matrix representing the connection between inliers. In particular, $\bm{K_{ab}}$ is the adjacency matrix between the $a$ -th and $b$ -th clusters. Under our model, each entry of $\bm{K_{ab}}$ is equal to 1 with probability $\theta_{i}\theta_{j}B_{ab}$ , independently of all others.

•

$\bm{P}\in\{0,1\}^{N\times N}$ is an unknown permutation matrix, in which there is a single 1 in each row and column while all other entries are 0. Under this permutation, the nodes are ordered according to the underlying structure of clusters and outliers.

For each $a\in[r]$ , we denote the size of the $a$ -th cluster by be $l_{a}:=|C^{*}_{a}|$ . Note that $n=\sum_{a=1}^{r}l_{a}$ . Let $l_{\min}=\min_{1\leq a\leq r}l_{a}$ be the minimum size of the clusters. We also introduce the vector of node degrees $\bm{d}=(d_{1},\cdots,d_{n+m})^{\top}$ , where $d_{i}:=\sum\limits_{j=1}^{n+m}A_{ij}$ .

For each candidate partition of the $(n+m)$ inliers into several clusters, we may associate it with a partition matrix $\bm{X}\in\{0,1\}^{(n+m)\times(n+m)}$ , such that $X_{ij}=1$ if and only if nodes $i$ and $j$ are assigned to the same cluster, with the convention that $X_{ii}=1$ . Ideally, we would like to find a partition matrix of the form

[TABLE]

where $\bm{J}_{l}$ denotes the $l$ -by- $l$ all-one matrix, and $*$ denotes arbitrary entries. In other words, we want to correctly recover the cluster structure within the inliers, where cluster assignment of the outliers may be arbitrary.

Given a single realization of the resulting random graph $\mathcal{G}=(V,\bm{A})$ , our goal is to recover the true inlier clusters $\left\{C_{a}^{*}\right\}_{a=1}^{r}$ , that is, to recover a partition matrix in the form of (2.2).

3 Algorithm: A Convex Relaxation Approach

In this section, we provide the motivation and description of our algorithm, which can handle both degree heterogeneity and outliers.

We begin by recalling that a classical approach to clustering nodes in a network is modularity maximization (Newman, 2006), which involves solving the optimization problem

[TABLE]

The negative of the objective of the above optimization problem is called modularity, which is a measure of the quality of the candidate clustering $\bm{X}$ . This quality measure is derived and studied in depth in the work of Newman (2006), which shows that maximizing the modularity provides a natural and robust framework for finding a good clustering of the nodes.

In general, the modularity optimization problem (3.1) is intractable due to the need of searching over partition matrices, a non-convex and combinatorial constraint. Replacing this constraint with a convex constraint, Chen et al. (2018) propose the following convex, SDP relaxation of modularity maximization:

[TABLE]

where we recall that $\bm{J}$ is the $n\times n$ all-one matrix. They provide recovery guarantees for the above convex relaxation under DCSBM without outliers. On the other hand, to handle outliers in the classical SBM setting, Cai and Li (2015) propose a convex relaxation formulation that penalizes the diagonal entries of $X$ :

[TABLE]

This formulation, however, is unable to handle DCSBM as it treats all nodes equally without considering the variation in their degrees.

Our Algorithm: Building on the formulations (3.2) and (3.3), we propose a convex relaxation formulation that accounts for both degree heterogeneity and outliers. Given that outliers can have any degree, we need to incorporate a larger penalization term on diagonal entries of $X$ . In particular, we penalize a potential outlier whose degree exhibits unusual behavior beyond the normal variation implied the DCSBM.

To be more specific, our algorithm depends on several quantities of the model. For each true cluster $C_{a}^{*}$ , we define the aggregate degree heterogeneity parameter as

[TABLE]

Consequently, the expected number of edges from each node $i$ to other inliers is equal to $\theta_{i}H_{a}$ , where

[TABLE]

Our diagonal penalization term is based on the quantity $d_{i}^{*}:=\max\left\{d_{i},H^{+}\right\}$ , where $H^{+}:=\max_{1\leq a\leq r}H_{a}$ . With the above notation, we consider the following convex relaxation formulation

[TABLE]

One can see that our formulation is a convex relaxation of modularity maximization with an additional node-dependent regularization term on the diagonal entries of $\bm{X}$ . In particular, we penalize each node $i$ differently with the weight $d_{i}^{*}$ , which is an upper bound of the node’s degree $d_{i}$ that also captures the positive deviation from the expected connections to other inliers. The tuning parameter $\alpha$ controls the strength of this diagonal penalization, and should be chosen to be sufficiently large. In our theoretical results in the next section, we provide guidance on how to choose $\alpha$ ; in particular, we need $\alpha\geq c_{1}\frac{m}{H^{-}}$ , where $H^{-}:=\min_{1\leq a\leq r}H_{a}$ and $c_{1}$ is a numerical constant.

4 Theoretical Guarantees

In this section, we provide theoretical guarantees on the performance of the convex optimization approach (3.4) under the setting of DCSBM with Outliers described in Section 2. Before stating our main theorem, we introduce several quantities of interest, and record some useful relationships between them.

4.1 Additional Notations and Preliminary Facts

We first provide a summary of the notations used in the sequel. Without loss of generality, assume that the first $n$ nodes, $\{1,2,\ldots,n\}$ , are inliers.

•

$p^{+}:=\max_{1\leq a\leq r}B_{aa},$ and $~{}p^{-}:=\min_{1\leq a\leq r}B_{aa}$

•

$q^{+}:=\max_{1\leq a<b\leq r}B_{ab},$ and $~{}q^{-}:=\min_{1\leq a<b\leq r}B_{ab}$

•

$\theta_{\min}:=\min_{1\leq i\leq n}\theta_{i}$ .

•

$G_{a}:=\sum\limits_{i\in C_{a}^{*}}\theta_{i}$ , $H_{a}:=\sum\limits_{1\leq b\leq r}B_{ab}G_{b}$ , $H^{+}:=\max\limits_{1\leq a\leq r}H_{a}$ , and $H^{-}:=\min\limits_{1\leq a\leq r}H_{a}$ , as defined previously.

•

$\bar{\theta}:=\sum\limits_{i=1}^{n}\theta_{i}/n$ , $\bar{\theta}_{a}:=\frac{G_{a}}{l_{a}}$ , $\bar{\theta}_{\min}=\min\limits_{1\leq a\leq r}\bar{\theta}_{a}$ , and $G_{\min}=\min_{1\leq a\leq r}G_{a}.$

•

$f_{i}:=\theta_{i}H_{a}$ , which is the expected degree of $i$ -th vertex with inliers.

•

$\widetilde{d}_{a}:=\sum\limits_{i\in C_{a}^{*}}d_{i}$ , which is the sum of the degrees of nodes in cluster $C_{a}^{*}$ .

•

For a matrix $\bm{T}$ and each pair $1\leq a,b\leq r$ , we use $\bm{M}_{(a,b)}\in\mathbb{R}^{l_{a}\times l_{b}}$ to denote the submatrix of $\bm{M}$ with entries indexed by $C_{a}^{*}\times C_{b}^{*}$ .

By definition, it is clear that

[TABLE]

for all $1\leq a\leq r.$ Note that the expected degrees of inliers is determined by the quantities $\theta_{i}$ and $H_{a}$ ; in fact, we have $\mathbb{E}\sum_{1\leq j\leq n}A_{ij}=f_{i}:=\theta_{i}H_{a}$ .

4.2 Guarantee for Perfect Clustering

We are now ready to state the main result of the paper. Recall that our goal is to find a partition matrix of the form (2.2) given the adjacency matrix $\bm{A}$ , that is, to recover the cluster structure of the inliers from a single realization of a graph generated from DCSBM with outliers. The theorem below, proved in Section 6, provides sufficient conditions for when our convex relaxation approach (3.4) achieves this goal.

Theorem 1.

Assume that $p^{+}\asymp p^{-}\asymp q^{+}\asymp q^{-}$ and $\bar{\theta}\asymp\bar{\theta}_{\min}$ . Suppose that $q^{-}\geq\frac{m}{l_{\min}}$ and

[TABLE]

for some $\delta>0$ , and that the tuning parameters in (3.4) satisfy

[TABLE]

and

[TABLE]

where $c_{0},c_{1}>0$ are sufficiently large numerical constants. Then with probability at least $1-\frac{1}{n}-\frac{2r}{n^{2}}-\frac{cr}{l^{4}_{\min}}$ for some constant $c$ , any solution $\bm{\widehat{X}}$ to the semidefinite program (3.4) must be of the form

[TABLE]

where $\bm{P}$ is a permutation matrix.

Theorem 1 guarantees that any optimal solution $\bm{\widehat{X}}$ satisfies the property that for any inliers $i$ and $j$ , $\widehat{X}_{ij}=1$ if nodes $i$ and $j$ are in the same true cluster and $\widehat{X}_{ij}=0$ otherwise. In other words, $\bm{\widehat{X}}$ correctly recovers the true cluster structure of the inliers. Since we impose no assumption on the outliers, there is in general no hope of determining how outliers would be clustered. Consequently, the theorem does not provide guarantees on the values of the elements on the last $m$ rows and $m$ columns of $\bm{\widehat{X}}$ . Nevertheless, the theorem ensures that the presence of the outliers does no hinder the clustering of the inliers.

Once we obtain the solution $\bm{\widehat{X}}$ as above, we can extract from it an explicit clustering of the inliers by treating each row of $\bm{\widehat{X}}$ as a point in $\mathbb{R}^{n+m}$ and running the $k$ -means algorithm; see Cai and Li (2015); Chen et al. (2018) for the details.

The results in Theorem 1 are non-asymptotic and valid for finite $n$ ; in particular, the probability for recovery has the form $1-O(\frac{1}{n})$ , which is the same as in Cai and Li (2015, Theorem 3.1) and Chen et al. (2018, Theorem 3.3). Let us parse the recovery condition in Theorem 1 under the simplified setting with $p^{+}=p^{-}=p$ , $q^{+}=q^{-}=q$ , and $l_{a}=l_{\min},\forall 1\leq a\leq r$ ; that is, the connectivity matrix $\bm{B}$ has diagonal entries all equal to $p$ and off-diagonal entries all equal to $q$ , and all clusters have the same size $l_{\min}$

•

First consider the special case where the node degrees are uniform (no degree heterogeneity); that is, $\theta_{i}=1,\forall 1\leq i\leq n.$ In this case, noting that $p\asymp q$ by assumption and performing some algebra, we find that the conditions (4.1)–(4.3) simplify to

[TABLE]

where $f:=qn+(p-q)l_{\min}$ is the expected inlier degree. Up to a rescaling by $f$ , these conditions match those in Cai and Li (2015, Theorem 3.1) under the same setting.

•

Next consider the special case where there is no outliers; that is, $m=0$ . In this case, we may take $\alpha=0$ ; moreover, by again noting that $p\asymp q$ and performing some algebra, we find that the conditions (4.1)–(4.2) become

[TABLE]

These conditions match those in Chen et al. (2018, Theorem 3.3) except for an addition term $\frac{\theta_{\max}}{\theta_{\min}}$ in the gap condition for $\delta$ .

Therefore, in the special cases of SBM with oultiers and DCSBM, we see that Theorem 1 is strong enough to essentially recover the results in Cai and Li (2015); Chen et al. (2018) as corollaries. Moreover, Theorem 1 strictly generalizes their results as it is applicable in the setting with both outliers and degree heterogeneity.

5 Experiments

In this section, we provide numerical experiment results demonstrate the performance of our algorithm for clustering heterogeneous networks with outliers. We also compare our algorithm with several state-of-the-art algorithms.

Recall the structure of the adjacency matrix $\bm{A}$ as given in equation (2.1), which we reproduce below

[TABLE]

With this in mind, we now describe how we generate the inlier part $\bm{K}$ and the outlier part $(\bm{Z},\bm{W})$ of the adjacency matrix.

Inliers:

For each inlier node $i\in[n]$ , the degree heterogeneity parameter $\theta_{i}$ is sampled independently from a Pareto( $\alpha$ , $\beta$ ) distribution with the density function $f(x|\alpha,\beta)=\frac{\alpha\beta^{\alpha}}{x^{\alpha+1}}\mathbf{1}_{\{x\geq\beta\}}$ , where $\alpha$ and $\beta$ are called the shape and scale parameters, respectively. We consider different values of the shape parameter, and choose the scale parameter accordingly so that the expectation of each $\theta_{i}$ is fixed at $1$ . Note that the heterogeneity of the degree $\theta_{i}$ ’s decreases as the shape parameter $\alpha$ increases. Given the above $\bm{\theta}$ and two given inter and intra-cluster density parameters $0<q<p<1$ , we then generate $\bm{K}$ according to DCSBM with parameters $p$ , $q$ and the $\bm{\theta}$ .

Outliers:

For generating the outliers we follow (Cai and Li, 2015, pp. 7). Let $\tau\in[0,1]$ be a fixed number. We assume that for each $i\in[n]$ and $j\in[m]$ , $Z_{ij}\sim\text{Bernoulli}(\rho_{i}\tau)$ and $\sqrt{\rho_{i}}\sim\text{Uniform}(0,1)$ . We also assume that for each $1\leq i<j\leq m$ , $W_{ij}\sim\text{Bernoulli}(0.7\tau)$ . Here $\tau$ controls the degrees of the outliers.

In the following experiments, we choose the parameter $\tau$ such that the outliers’ expected degree is moderately above the average of the inliers’ degrees. Given that the inliers’ degrees are heavy-tailed, this means that the outlier’s degrees are not distinguishable from inliers with a larger degree. The larger the $\tau$ parameter, the harder is the recovery problem.

In Figure 1 we show the performance of our algorithm in terms of the misclassification rate. Here we consider varying values for the shape parameter, the number of outliers and intra-cluster density parameter $p$ . The inter-cluster density parameter is $q=p/3$ . As can be seen from the figure, as the problem gets harder in terms of more heterogeneity, more outliers or more sparsity, the performance of our algorithm degrades gracefully. For $p$ as low as $20\%$ we note that the performance suffers only very little as the degree distribution gets significantly heavier (as captured by a shape parameter $\alpha$ ) and as we increase the number of outliers. Very sparse graphs (with intra-cluster connectivity $p=8\%$ are naturally more sensitive.

In Figure 2, we consider a setting similar to Figure 1, but with larger graphs $n=1000$ . The results demonstrate the same relatively unhindered performance under increased heterogeneity and number of outliers, when the graph is not too sparse.

We next decrease the connectivity of the outliers, as we set $\tau=0.5$ . In this case, the problem becomes easier, as outliers are more restricted. As shown in Figure 3, the misclassification rates decrease and remain small even as we increase the number of outliers and the heterogeneity of the inliers.

Finally, in Figure 4, we compare our algorithm with three state-of-the-art algorithms: spectral clustering (Zhang et al., 2014), SCORE (Jin, 2015) and Cai-Li (Cai and Li, 2015). The gain in performance is significant, and in particular for the more adversarial settings with high degree.

6 Proof of Theorem 1

In this section, we prove our main result in Theorem 1.

6.1 Roadmap of the Proof

The high level strategy of the proof involves using a primal-dual witness approach, which consists of two steps:

We first construct a candidate optimal primal solution to the convex program (3.4). This is done by solving an auxiliary optimization problem; see Lemma 1. 2. 2.

We then certify that this candidate solution is indeed optimal by showing that it satisfies a form of the first-order optimality (KKT) condition, which involves the existence of a corresponding dual solution/certificate. This is done by explicitly constructing the dual certificate and proving that it has the desired properties with high probability. A crucial step in the analysis is to decompose the penalized connecting matrix $\alpha{\rm diag}\left(\bm{d^{*}}\right)+\lambda\bm{dd^{T}}-\bm{A}$ into four terms and establish high-probability bounds for each of them.

The reason for using the above strategy is as follow: Our goal is to recover the true inlier clusters, so the “inlier part” of the desired solution should have a block-diagonal form that corresponds to ground truth clusters, as in equation (2.2). However, a priori we do not know what the “outlier part” of the solution will look like — it depends on the edge connection of the outliers, and in general will not be exactly zero. Therefore, we need to first “pin down” the outlier part of the solution, which is precisely the Step 1 above. To show this solution is indeed optimal, we prove that there exists a corresponding dual solution that “certifies” its optimality, which is the goal of the Step 2 above. Below we elaborate on the main technical challenges and novelty in these two steps.

In Step 1, we construct a candidate solution $\bm{X^{\star}}=\bm{V^{*}V^{*^{T}}}$ that is feasible to the primal problem. A major difficulty of proving the optimality of $\bm{X^{\star}}$ is in that a priori we do not know the exact value the matrix $\bm{X^{\star}}$ . To overcome this difficulty, we note that the candidate solution $\bm{X^{\star}}$ is constructed from the optimal solution of the auxiliary optimization problem. The KKT condition of the auxiliary optimization problem gives several desirable constraints for the outlier parts of its primal and dual solutions (i.e., the constraints on $\beta$ and $\bm{x_{a}}$ in Lemma 1); in particular, the solution $\sum_{a=1}^{r}\bm{x_{a}}\bm{x_{a}}^{T}$ must be perpendicular to the normal vector of the semidefinite cone constraint. We show that this property is equivalent to $\bm{\Lambda V^{*}}=\bm{0}$ , where $\bm{\Lambda}$ is the outlier part of the matrix $\bm{E}=\alpha{\rm diag}\left\{\bm{d^{*}}\right\}+\lambda\bm{d}\bm{d^{T}}-\bm{A}$ that appears in the objective of our convex relaxation approach (3.4); cf. (6.16). This property allows us to understand the effect of outlier part of the solution $\bm{X^{\star}}$ and subsequently find the closed form of other parts.

In Step 2, to establish the optimality of $\bm{X^{\star}}$ , we need to show that it has an objective value no larger than that of any other feasible solution $\bm{X}$ . In other words, we need to show that $\Delta\left(\bm{X}\right)\triangleq\langle\bm{X}^{*}-\bm{X},\bm{E}\rangle\leq 0$ . To this end, we make use of the property of the matrix $\bm{E}$ , which can be decomposed as in (6.16) into the block-diagonal part (i.e., the term within inlier clusters $\bm{\Psi}$ ), the off-diagonal part (i.e., the term between inlier clusters $\bm{\Phi}$ ), the outlier-inlier part (i.e., the term between inliers and outliers $\bm{\Gamma}$ ), and the within-outlier part (i.e., the term within outlier set $\bm{\Lambda}$ ). For example, the element in the block-diagonal part is the sum of some inliers’ degree terms and a Bernoulli random variable with a relatively large parameter, while the element in the outlier-inlier part is the sum of some outliers’ degree terms. As mentioned, in Step 1 we establish several structural properties of $\bm{X^{\star}}$ . Combining these properties of $\bm{X^{\star}}$ and those of $\bm{E}$ , we can apply probability concentration inequalities to separately bound the four terms $\langle\bm{X}^{*}-\bm{X},\bm{\Psi}\rangle$ , $\langle\bm{X}^{*}-\bm{X},\bm{\Phi}\rangle$ , $\langle\bm{X}^{*}-\bm{X},\bm{\Gamma}\rangle$ and $\langle\bm{X}^{*}-\bm{X},\bm{\Lambda}\rangle$ that contribute to $\Delta\left(\bm{X}\right)$ .

The most challenging point lies in proving that matrix $\bm{\Lambda}$ corresponding to the outliers is positive semidefinite. To achieve this, we need to choose the tuning parameter $\alpha$ appropriately, and relate the matrix $\bm{\Lambda}$ to another matrix $\bm{\widetilde{\Lambda}}$ , which excludes the “between inliers” matrix $\bm{\Phi}$ . Then the problem becomes proving that $\bm{\widetilde{\Lambda}}$ is a positive semidefinite matrix. We again separate $\bm{\widetilde{\Lambda}}$ in the inlier part and outlier part and prove the positive semidefinite matrix property by Gershgorin Theorem (Horn and Johnson, 2012), namely that the absolute value of the diagonal entry is larger than the sum of all off-diagonal entries in the same row. Another difficulty is that we need to adjust the parameter $d_{i}$ so that it has appropriate lower and upper bounds when we apply the Gershgorin Theorem. This is the technical reason why we use $d_{i}^{*}$ instead of $d_{i}$ in the diagonal penalization term in (3.4).

Before proceeding with the proof, we note several useful facts. The condition (4.1) in Theorem 1 implies that $\theta_{i}H_{a}\geq m$ , i.e., an inlier’s expected number of connections to other inliers is larger than the number of outliers. Moreover, we also have the following upper bound on the maximum of degree of an inlier: with probability at least $1-\frac{1}{n^{2}}$ ,

[TABLE]

This bound can be proved using the Chernoff’s inequality, which ensures that $d_{i}\leq\left(1+\frac{\delta}{5B_{aa}}\right)f_{i}+m\leq 2(\theta_{i}H_{a}+m)\leq 4\theta_{i}H_{a}$ with probability at least $1-\frac{1}{n^{3}}$ . Finally, we have the relationship $\bar{\theta}\geq\bar{\theta}_{\min}>C_{0}>0$ , which follows from the definitions of these quantities and the condition (4.1).

6.2 Step 1: Solution Candidate

In this section, we construct a candidate solution $\bm{X}$ feasible to our convex relaxation (3.4). Define the matrices

[TABLE]

Consequently, we have the expression

[TABLE]

Since the desired candidate solution of optimization problem (3.4) has a block-diagonal structure in the inlier part, the cost of inlier part is fixed. We therefore focus on minimizing the cost of the outlier part. The objective function of the optimization problem (6.5) is actually the $n+1,n+2,\cdots,n+m$ rows and columns of the objective function of (3.4). The following lemma, proved in Appendix A, guarantees the existence of $r$ vectors $\bm{x}_{1},\cdots,\bm{x}_{r}\in\mathbb{R}^{m}$ . These vectors are used to construct a candidate solution.

Lemma 1.

If assumptions (4.2) and (4.3) hold, then the solution to

[TABLE]

exists and is unique. Moreover, denote the solutions by $\bm{x}_{1},\cdots,\bm{x}_{r}\in\mathbb{R}^{m}$ , which by definition satisfy $\left\|{x_{a}}\right\|_{\infty}\leq 1$ . Then there are nonnegative vectors $\bm{\beta}_{1},\cdots,\bm{\beta}_{r}\in\mathbb{R}^{m}$ and an $m\times m$ nonnegative diagonal matrix

[TABLE]

such that

[TABLE]

In addition, we have

[TABLE]

Furthermore, for all $a=1,\cdots,r$ and $j=1,\cdots,m$ , we have

[TABLE]

Finally, for all $a=1,\cdots,r$ , we have

[TABLE]

To proceed, we define the matrices

[TABLE]

and

[TABLE]

Since $\bm{x_{a}}$ ’s are feasible to optimization problem (6.5), we can easily see that $\bm{X}^{*}$ is feasible to optimization problem (3.4). In the sequel, we will prove that the $\bm{X}^{*}$ is actually an optimal solution to (3.4).

6.3 Step 2: Verification of the solution to the dual problem

To establish the theorem, it suffices to show for any feasible solution $\bm{X}$ to the program (3.4) with $\bm{X}\neq\bm{X}^{*}$ , there holds

[TABLE]

To this end, we will prove that $\Delta\left(\bm{X}\right)$ can be decomposed as

[TABLE]

where the matrices $\bm{\Psi}$ , $\bm{\Phi}$ and $\bm{\Gamma}$ have the form

[TABLE]

and the matrix $\bm{\Lambda}=\alpha{\rm diag}\left\{\bm{d^{*}}\right\}+\lambda\bm{d}\bm{d^{T}}-\bm{A}-(\bm{\Psi}+\bm{\Phi}+\bm{\Gamma})$ satisfies $\bm{\Lambda}\bm{V^{\star}}=\bm{0}$ .

In the following, we will construct one by one the matrices $\bm{\Lambda},\bm{\Psi}$ and $\bm{\Phi}$ in the decomposition (6.16) and prove that $\bm{\Psi}_{aa}>0$ , $\bm{\Phi}_{ab}>0$ and $\bm{\Lambda}\succeq\bm{0}$ . Finally, we will prove that $S_{1}<0$ and $S_{i}\leq 0$ for $i=2,3,4$ , from which we can conclude that $\Delta\left(\bm{X}\right)<0$ and thereby finish the proof.

6.3.1 Construction of $\Psi_{aa}$ and $\Phi_{ab}$ in (6.16)

The equality $\bm{\Lambda}\bm{V^{*}}=\bm{0}$ yields that

[TABLE]

It is clear that (6.17) is equivalent to (6.7). In the following, we will construct $\bm{\Psi}_{aa}$ satisfying (6.18) and $\bm{\Phi}_{ab}$ satisfying both (6.19) and (6.20).

The equality (6.18) is equivalent to

[TABLE]

where the last equality is due to $\langle\bm{x}_{a},\bm{\beta}_{a}\rangle=0$ . To ensure $\bm{\Psi}_{aa}>0$ , we construct $\bm{\Psi}_{aa}$ as the sum of a non-negative diagonal matrix plus a positive matrix. In particular, we set

[TABLE]

Setting $\epsilon=\frac{\delta}{10}$ satisfies our requirements.

Next let us construct $\bm{\Phi}_{ab}\in\mathbb{R}^{l_{a}\times l_{b}}$ satisfying both (6.19) and (6.20). These two equalities are equivalent to

[TABLE]

One can verify that

[TABLE]

If we set

[TABLE]

then $\bm{\Phi_{ab}}$ satisfies (LABEL:eq:phiab1) and (LABEL:eq:phiab2). After simplification, we obtain

[TABLE]

As we have shown, $\bm{\Psi}$ and $\bm{\Phi}$ are well defined, so $\bm{\Lambda}$ is given by $\bm{\Lambda}=\bm{E-\Psi-\Phi-\Gamma}$ . In the following, we will study the properties of these matrices and give lower bounds for terms $S_{1},S_{2},S_{3}$ and $S_{4}$ defined in (6.16).

6.3.2 The $S_{1}$ Term in (6.16)

We will show that

[TABLE]

Notice that $\bm{\Psi_{aa}}-\epsilon\bm{\theta_{(a)}}\bm{\theta_{(a)}}^{T}$ is a diagonal matrix, so we only need to check that each entry on the diagonal is larger or equal than 0. Since $\bm{Z_{a}}\geq\bm{0}$ , $\bm{x_{a}}\geq\bm{0}$ , and $\left\|{\bm{x_{a}}}\right\|_{\infty}\leq 1$ , it is sufficient to prove

[TABLE]

Notice that

[TABLE]

and

[TABLE]

By Lemma 3, we have with probability at least $1-\frac{1}{n^{2}}$

[TABLE]

and

[TABLE]

In addition, by Chernoff’s Inequality, with probability at least $1-\frac{1}{n^{3}}$ we have

[TABLE]

Thus, with probability $\geq 1-\frac{1}{n^{2}}$ , we have

[TABLE]

where last inequality is due to the fact that $\theta_{i}H_{a}\geq m$ .

Combining pieces, we see that the following is sufficient for our goal:

[TABLE]

Note that the condition (4.1) in Theorem 1 fulfills all the requirements above, thus we have $\bm{\Psi_{aa}}-\epsilon\bm{\theta_{(a)}}\bm{\theta_{(a)}^{T}}\geq\text{and}\succeq\bm{0}$ . This implies the weaker result that $\bm{\Psi_{aa}}>\bm{0}$ .

Finally, we have

[TABLE]

where the last inequality is due to the fact that all entries of $\bm{X^{*}}_{(a,a)}$ equal to 1 and all entries of $\bm{X}_{(a,a)}$ are no larger than 1.

6.3.3 The $S_{2}$ Term in (6.16)

We will first prove that $\bm{\Phi_{ab}>\bm{0}}$ . For the first three terms in (6.27), we apply Lemma 4 to get

[TABLE]

Lemma 1 also proves that $\bm{x_{a}^{T}\left(\widetilde{W}+\Xi\right)x_{b}}\leq m\sqrt{l_{a}l_{b}}$ .

To bound the forth and fifth term in (6.27), we first bound $\bm{\widetilde{Z_{a}}x_{b}}=\left(\lambda\bm{d_{(a)}d_{(r+1)}^{T}}-\bm{Z_{a}}\right)\bm{x_{b}}$ . Since $\bm{x_{b}}\geq 0$ , $\left\|{\bm{x_{b}}}\right\|_{\infty}<1$ and $\bm{Z_{a}}$ is a 0-1 matrix, we have

[TABLE]

where upper bound of $\bm{\widetilde{Z_{a}}x_{b}}$ is due to the facts that $f_{i}=\theta_{i}H_{a}+m\leq 2\theta_{i}H_{a}$ and $\left(B_{aa}-\delta\right)\left(1+\frac{\delta}{5B_{aa}}\right)\leq B_{aa}-\frac{4}{5}\delta$ .

Therefore, to prove $\bm{\Phi_{ab}}>0$ , we only need to prove that

[TABLE]

which is implied by

[TABLE]

Note that the condition (4.1) in Theorem 1 fulfills all the requirements above, thus we have $\bm{\Phi_{ab}}>0$ .

Finally, we have

[TABLE]

where the last inequality is due to the fact that $\bm{X^{*}}_{(a,b)}=\bm{0}$ and $\bm{X}_{(a,b)}\geq\bm{0}$ .

6.3.4 The $S_{3}$ Term in (6.16)

By the feasibility of $\bm{X}$ and the non-negativity of $\bm{\beta}_{a}$ and $\bm{\Xi}$ , we have

[TABLE]

and

[TABLE]

By (6.9), i.e. $\langle\bm{x}_{a},\bm{\beta}_{a}\rangle=0$ , we have

[TABLE]

By (6.8), we have

[TABLE]

It follows that

[TABLE]

where the last inequality is due to the fact that all entries of $\bm{X_{(r+1,r+1)}}$ are no larger than 1.

6.3.5 The $S_{4}$ Term in (6.16)

We will first prove that $\bm{\Lambda}\succeq\bm{0}$ . The condition $\bm{\Lambda V^{*}}=\bm{0}$ implies that $\operatorname{rank}\left(\bm{\Lambda}\right)\leq N-r$ . Thus we only need to prove that the $(N-r)$ -th largest eigenvalue of $\bm{\Lambda}$ is no smaller than [math] while all other smaller eigenvalues are equal to [math].

We define the matrix

[TABLE]

One sees that $\bm{\widehat{V}}$ is a basis matrix, i.e., the columns of $\bm{\widehat{V}}$ are orthogonal unit vectors. Take $\bm{\widehat{V}}_{\perp}\in\mathbb{R}^{N\times(N-r)}$ such that $\bm{U}=\left[\bm{\widehat{V}}_{\perp},\bm{\widehat{V}}\right]$ is an orthogonal matrix. Define the matrix

[TABLE]

The matrix $\bm{\widetilde{\Lambda}}$ is close to $\bm{\Lambda}$ in the sense that

[TABLE]

Note that each entry of $\bm{\widetilde{\Lambda}-\Lambda}$ takes the form of $C\bm{\theta_{(a)}\theta_{(b)}^{T}}$ , where $C$ is a constant. Thus we have $\bm{\widehat{V}}_{\perp}^{T}\left(\bm{\widetilde{\Lambda}-\Lambda}\right)\bm{\widehat{V}}_{\perp}=0$ , or $\bm{\widehat{V}}_{\perp}^{T}\bm{\widetilde{\Lambda}}\bm{\widehat{V}}_{\perp}=\bm{\widehat{V}}_{\perp}^{T}\bm{\Lambda}\bm{\widehat{V}}_{\perp}$ . Since the matrix

[TABLE]

has the same eigenvalues as $\bm{\Lambda}$ does, Weyl’s Inequality implies that

[TABLE]

Thus we only need to prove $\bm{\widetilde{\Lambda}}\succ\bm{0}$ .

To this end, we consider the decomposition

[TABLE]

In Section 6.3.1, we proved $\bm{\Psi_{aa}}\succeq\frac{\delta}{4}G_{a}{\rm diag}\left(\bm{\theta_{(a)}}\right)+\frac{\delta}{10}\bm{\theta_{(a)}\theta_{(a)}^{T}}$ . Combining with (6.87), we have

[TABLE]

Thus we have

[TABLE]

Combining with the bound (6.88) to be proved later, we have

[TABLE]

By taking

[TABLE]

when $C$ is large enough, we obtain that

[TABLE]

With the above bound, to prove $\bm{\widetilde{\Lambda}}\succ\bm{0}$ , it suffices to prove

[TABLE]

Set $w:=\frac{Cm}{G_{\min}\theta_{\min}\delta}$ with a sufficiently large constant $C$ . By multiplying both sides of (6.58) by $\begin{bmatrix}w\bm{I_{n}}&\bm{0}\\ \bm{0}&\bm{I_{m}}\end{bmatrix}$ , it suffices to prove

[TABLE]

The above inequality is true if we can prove that the sum of absolute value of all off-diagonal entries is less than the absolute value of corresponding diagonal entry.

For the first $n$ rows, we have

[TABLE]

Since $\bm{0}\leq\bm{Z_{a}}\leq\bm{J_{(l_{a},m)}}$ , and by Lemma 1, $\bm{0}\leq\bm{\beta_{a}}\leq\lambda\left(\widetilde{d}_{a}+\widetilde{d}_{r+1}\right)\bm{d_{(r+1)}}$ , the sum of absolute value of $i$ -th row of $w\bm{\widetilde{\Lambda_{2}}}$ is no larger than

[TABLE]

Therefore, we only need to prove

[TABLE]

Note that $H_{a}=\sum\limits_{b=1}^{r}B_{ab}G_{b}\geq n\bar{\theta}q^{-}$ . The inequality (6.62) is implied by the following four conditions:

[TABLE]

and

[TABLE]

and

[TABLE]

and

[TABLE]

The last inequality is due to the fact that $G_{a}\geq\theta_{\min}l_{\min}\geq\frac{\theta_{\min}m}{q^{-}}$ and $H_{a}\geq q^{-}n\bar{\theta}_{\min}$ where $\bar{\theta}$ is the average value of all $\theta_{i}$ ’s.

To study the bottom $m$ rows of $\bm{\widetilde{\widetilde{\bm{\Lambda}}}}$ , we notice that

[TABLE]

so the sum of all absolute values of $j$ -th row of $w\bm{\widetilde{\Lambda_{2}}^{T}}$ is not larger than

[TABLE]

On the other hand, the sum of absolute values of off-diagonal entries in the $j$ -th row of $\bm{\widetilde{W}+\Xi}=\alpha{\rm diag}\left(\bm{d^{*}_{(r+1)}}\right)+\lambda\bm{d_{(r+1)}}\bm{d_{(r+1)}^{T}}-\bm{W}+\bm{\Xi}$ is no larger than

[TABLE]

The $j$ -th diagonal entry of $\bm{\widetilde{W}+\Xi}$ is no smaller than

[TABLE]

Combining pieces, we see that it suffies to establish

[TABLE]

By requiring $\delta\geq C\frac{m\sqrt{r}}{\theta_{\min}G_{\min}}$ , we have $w\sqrt{r}=\frac{m\sqrt{r}}{\theta_{\min}G_{\min}\delta}<\frac{1}{2}$ for sufficiently large constant $C$ . Thus we only need to prove

[TABLE]

Notice that $d^{*}_{n+j}\geq\max\left\{d_{n+j},\max\limits_{a}H_{a}\right\}$ , so the equality (6.76) is implied by the conditions:

[TABLE]

and

[TABLE]

and

[TABLE]

Note that the condition (4.1) in Theorem 1 fulfills all the requirements above. We conclude that $\bm{\Lambda}\succ\bm{0}$ .

Finally, we have

[TABLE]

where the last inequality is due to the fact that $\bm{\Lambda V^{*}}=\bm{0}$ and $\bm{X}$ and $\bm{\Lambda}$ are both positive semi-definite matrix.

6.4 Concluding the proof

In conclusion, we have proved that $S_{1}<0$ and $S_{2},S_{3},S_{4}\leq 0$ . Thus $\Delta(\bm{X})=\langle\bm{X}^{*}-\bm{X},\bm{E}\rangle=S_{1}+S_{2}+S_{3}+S_{4}<0$ and we have finished the proof of Theorem 1.

6.5 Technical Lemmas

Lemma 2 (Chernoff’s Inequality).

Let $X_{1},X_{2},\cdots,X_{n}$ be independent random variables with

[TABLE]

Then the sum $X=\sum_{i=1}^{n}X_{i}$ has expectation $\mathbb{E}(X)=\sum_{i=1}^{n}p_{i}$ and we have

[TABLE]

and

[TABLE]

Lemma 3.

If we define $f_{i}=\mathbb{E}\left[d_{i}\right]$ , then with probability at least $1-\frac{1}{n^{2}}$ , we have for all $i=1,2,\cdots,n$ ,

[TABLE]

Further, if we assume $\delta\geq C_{1}\sqrt{\frac{p^{+}\log n}{\theta_{\min}G_{\min}}}$ , where $p^{+}=\max\limits_{a}B_{aa}$ , we have

[TABLE]

Proof.

The inequalities (6.78) and (6.79) are the straightforward consequences of Chernoff’s Inequality. These inequalities imply that $\left|d_{i}-f_{i}\right|\leq 2\log n+\sqrt{6f_{i}\log n}$ . Since $f_{i}=\theta_{i}H_{a}\geq q^{-}\theta_{\min}G_{\min}$ , it follows from the assumption of Theorem 1 that

[TABLE]

Therefore, as long as $C$ is large enough, we have $\sqrt{\log n/f_{i}}\leq 1$ . Thus $\frac{\log n}{f_{i}}\leq\sqrt{\frac{\log n}{f_{i}}}\leq 1$ and (6.80) follows immediately. ∎

Lemma 4.

With high probability at least $1-\frac{2}{n}-\frac{2r}{n^{2}}$ , the following inequalities hold for all $1\leq a<b\leq r$ :

[TABLE]

Proof.

Proof of (6.82): The entry on the $i$ -th row of and $j$ -th column $\bm{K}$ follows the Bernoulli distribution of mean $\theta_{i}\theta_{j}B_{ab}$ . Thus the sum of all entries of $\bm{K_{ab}}$ has a mean of $G_{a}G_{b}B_{ab}$ . By Chernoff’s Inequality, we have

[TABLE]

Let $t=\sqrt{6G_{a}G_{b}B_{ab}\log n}$ , we have with probability at least $1-\frac{1}{n^{3}}$ ,

[TABLE]

Note that $\delta\geq C\sqrt{\frac{p^{+}\log n}{\theta_{\min}G_{\min}}}\geq C\sqrt{\frac{B_{ab}\log n}{G_{a}G_{b}}}$ hold for sufficiently large constant $C$ , we have $\bm{1_{l_{a}}^{T}}\bm{K}\bm{1_{l_{b}}}\geq\left(B_{ab}-\frac{1}{25}\delta\right)$ , and therefore (6.82) holds.

Proof of (6.83) and (6.84): The entry on the $i$ -th row of and $j$ -th column $\bm{K}$ follows the Bernoulli distribution of mean $\theta_{i}\theta_{j}B_{ab}$ . Thus the sum of $i$ -th row of $\bm{K_{ab}}$ has a mean of $\theta_{i}G_{b}B_{ab}$ . By Chernoff’s Inequality, we have

[TABLE]

Let $t=2\log n+\sqrt{6\theta_{i}G_{b}B_{ab}\log n}$ , we have with probability at least $1-\frac{1}{n^{3}}$ ,

[TABLE]

Note that $\delta>C\frac{\log n}{\theta_{\min}G_{\min}}\geq C\frac{\log n}{\theta_{i}G_{b}}$ and $\delta\geq C\sqrt{\frac{p^{+}\log n}{\theta_{\min}G_{\min}}}\geq C\sqrt{\frac{B_{ab}\log n}{\theta_{i}G_{b}}}$ hold for sufficiently large constant $C$ . It follows that $\sum\limits_{j\in C_{b}^{*}}K_{ij}\leq\theta_{i}G_{b}(B_{ab}+\frac{1}{20}\delta)$ , and therefore (6.83) holds. The bound (6.84) can be proved similarly.

Proof of (6.85): By Lemma 3, for $i\in C_{a}^{*}$ and $j\in C_{b}^{*}$ , we have $d_{i}\geq\left(1-\frac{\delta}{5p^{+}}\right)\theta_{i}H_{a}$ and $d_{j}\geq\left(1-\frac{\delta}{5p^{+}}\right)\theta_{j}H_{b}$ . Note that

[TABLE]

It follows that $\lambda\bm{d_{(a)}}\bm{d_{(a)}}^{T}\geq\lambda\left(1-\frac{\delta}{5p^{+}}\right)^{2}H_{a}H_{b}\bm{\theta_{(a)}}\bm{\theta_{(b)}}^{T}\geq\left(B_{ab}+\frac{6}{25}\delta\right)\bm{\theta_{(a)}}\bm{\theta_{(b)}}^{T}$ , which finishes the proof. ∎

Lemma 5 (Chen et al., 2018, Lemma 5).

Let $\bm{A}=\left\{a_{ij}\right\}_{n\times n}$ be a symmetric random matrix. Moreover, suppose that $a_{ij}$ are independent zero-mean random variables satisfying $|a_{ij}|\leq 1$ and $\mathsf{var}(a_{ij})\leq\sigma^{2}$ . Then with probability at least $1-\frac{c}{n^{4}}$ , we have

[TABLE]

for some numerical constant $c$ and $C_{0}$ .

Lemma 6.

With high probability at least $1-c\frac{r}{l_{\min}^{4}}$ , we have

[TABLE]

and

[TABLE]

Proof.

Note that the element $B_{aa}\theta_{i}^{2}-K_{ii}$ is a random variable with zero mean and variance of $\theta_{i}^{2}B_{aa}\left(1-\theta_{i}^{2}B_{aa}\right)\leq\theta_{\max}^{2}B_{aa}$ . Therefore,the matrix $B_{aa}\bm{\theta_{(a)}\theta_{(a)}^{T}}-\bm{K_{aa}}$ satisfies the condition of Lemma 5 with $\sigma=\sqrt{\theta_{\max}^{2}B_{aa}}$ . Thus, with probability at least $1-\sum_{a=1}^{r}\frac{c}{l_{a}^{4}}$ , we have

[TABLE]

for some numerical constant $c$ and $C_{0}$ .

By a similar argument, we can prove that (6.88) holds. ∎

\appendixpage

Appendix A Proof of Lemma 1

Consider the $j$ -th row of the matrix $\alpha{\rm diag}\left(\bm{d}_{(r+1)}^{*}\right)-\bm{W}$ . The sum of absolute values of the diagonal entries is at most $m-1$ , whereas the absolute value of the corresponding diagonal entry is at least $\alpha d_{n+j}^{*}-1$ . Notice that $\alpha d_{n+j}^{*}\geq\left(C\max\limits_{1\leq a\leq r}\frac{m}{H_{a}}\right)H^{+}\geq Cm$ . Therefore, for sufficiently large constant $C$ (actually we only require $C>1$ ), we can prove that the diagonal entry is larger than the sum of absolute values of the diagonal entries. Gershgorin Theorem (Horn and Johnson, 2012) states that a matrix $\bm{A}={a_{ij}}_{n\times n}$ is a positive definite matrix if $|a_{ii}|>\sum_{j\neq i}a_{ij}$ for all $i=1,2,\cdots,n$ . Therefore, we obtain that the matrix $\alpha{\rm diag}\left(\bm{d}_{(r+1)}^{*}\right)-\bm{W}$ is a positive definite matrix. On the other hand, it is clear that $\lambda\bm{d}_{(r+1)}\bm{d}_{(r+1)}^{T}$ is a positive definite matrix. We conclude that the matrix $\bm{\widetilde{W}}:=\alpha{\rm diag}\left(\bm{d^{*}_{(r+1)}}\right)+\lambda\bm{d_{(r+1)}}\bm{d_{(r+1)}^{T}}-\bm{W}$ is a positive definite matrix. This implies that the objective function of the optimization problem (6.5) is strongly convex. The feasible set of the constraint (6.5) is convex and compact, so the optimal solution exists uniquely.

It is easy to see that there exist feasible solutions to the optimization problem (6.5) with all inequalities satisfied strictly. Therefore, by the constraint qualification under the Slater’s condition, we know that the solution $\bm{x_{1}}\cdots,\bm{x_{r}}$ must satisfy the KKT condition in (6.7), (6.8), and (6.9).

Since $\bm{\widetilde{W}x_{a}}+\bm{\widetilde{Z}_{a}^{T}1_{l_{a}}}=\bm{\beta_{a}-\Xi x_{a}}$ and $\langle\bm{x}_{a},\bm{\beta}_{a}\rangle=0$ , we have

[TABLE]

Because $\bm{\Xi}$ is a non-negative diagonal matrix, $\bm{\widetilde{W}+\Xi}$ is positive definite. By Cauchy-Schwarz Inequality, for all $1\leq a,b\leq r$ , we have

[TABLE]

Notice that equation (6.7) is equivalent to

[TABLE]

Taking the $j$ -th row yields and using the non-negative property of $\bm{W}$ and $\bm{x_{a}}$ , we have

[TABLE]

Finally, since $x_{a_{j}}\beta_{a_{j}}=0$ , if $\beta_{a_{j}}>0$ (thus $x_{a_{j}}=0$ ), we have

[TABLE]

or equivalently

[TABLE]

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adamic and Glance (2005) L. A. Adamic and N. Glance. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery , pages 36–43. ACM, 2005.
2Airoldi et al. (2008) E. Airoldi, M. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Mach. Learn. Res. , 9:1981–2014, 2008.
3Ames and Vavasis (2011) B. P. W. Ames and S. A. Vavasis. Nuclear norm minimization for the planted clique and biclique problems. Mathematical Programming , 129(1):69–89, 2011.
4Balcan and Gupta (2010) M.-F. Balcan and P. Gupta. Robust hierarchical clustering. In Conference on Learning Theory (COLT) , 2010.
5Bickel and Chen (2009) P. J. Bickel and A. Chen. A nonparametric view of network models and newman-girvan and other modularities. Proceedings of the National Academy of Sciences , 106(50):21068–21073, 2009.
6Bollobás and Scott (2004) B. Bollobás and A. D. Scott. Max cut for random graphs with a planted partition. Combinatorics, Probability and Computing , 13(4-5):451–474, 2004.
7Bordenave et al. (2018) C. Bordenave, M. Lelarge, and L. Massoulié. Nonbacktracking spectrum of random graphs: Community detection and nonregular ramanujan graphs. Ann. Probab. , 46(1):1–71, 01 2018.
8Cai and Li (2015) T. Cai and X. Li. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. Ann. Statist. , 43(3):1027–1059, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Clustering Degree-Corrected Stochastic Block Model with Outliers

Abstract

1 Introduction

1.1 Our Contributions

1.2 Notation

2 Problem Setup

3 Algorithm: A Convex Relaxation Approach

4 Theoretical Guarantees

4.1 Additional Notations and Preliminary Facts

4.2 Guarantee for Perfect Clustering

Theorem 1**.**

5 Experiments

Inliers:

Outliers:

6 Proof of Theorem 1

6.1 Roadmap of the Proof

6.2 Step 1: Solution Candidate

Lemma 1**.**

6.3 Step 2: Verification of the solution to the dual problem

6.3.1 Construction of Ψaa\Psi_{aa}Ψaa​ and Φab\Phi_{ab}Φab​ in (6.16)

6.3.2 The S1S_{1}S1​ Term in (6.16)

6.3.3 The S2S_{2}S2​ Term in (6.16)

6.3.4 The S3S_{3}S3​ Term in (6.16)

6.3.5 The S4S_{4}S4​ Term in (6.16)

6.4 Concluding the proof

6.5 Technical Lemmas

Lemma 2** (Chernoff’s Inequality).**

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5** (Chen et al., 2018, Lemma 5).**

Lemma 6**.**

Proof.

Appendix A Proof of Lemma 1

Theorem 1.

Lemma 1.

6.3.1 Construction of $\Psi_{aa}$ and $\Phi_{ab}$ in (6.16)

6.3.2 The $S_{1}$ Term in (6.16)

6.3.3 The $S_{2}$ Term in (6.16)

6.3.4 The $S_{3}$ Term in (6.16)

6.3.5 The $S_{4}$ Term in (6.16)

Lemma 2 (Chernoff’s Inequality).

Lemma 3.

Lemma 4.

Lemma 5 (Chen et al., 2018, Lemma 5).

Lemma 6.