Robust Clustering Oracle and Local Reconstructor of Cluster Structure of   Graphs

Pan Peng

arXiv:1904.09710·cs.DS·April 23, 2019

Robust Clustering Oracle and Local Reconstructor of Cluster Structure of Graphs

Pan Peng

PDF

Open Access

TL;DR

This paper introduces sublinear time algorithms for analyzing and reconstructing the cluster structure of large, noisy graphs using conductance-based definitions, enabling efficient local clustering and property testing.

Contribution

It formalizes noisy clusterable graphs, develops a robust clustering oracle, and provides a local reconstructor, all operating in sublinear time with noisy data.

Findings

01

Developed a sublinear time algorithm for analyzing cluster structure.

02

Constructed a robust clustering oracle supporting typical cluster queries.

03

Designed a local reconstructor for noisy clusterable graphs.

Abstract

Due to the massive size of modern network data, local algorithms that run in sublinear time for analyzing the cluster structure of the graph are receiving growing interest. Two typical examples are local graph clustering algorithms that find a cluster from a seed node with running time proportional to the size of the output set, and clusterability testing algorithms that decide if a graph can be partitioned into a few clusters in the framework of property testing. In this work, we develop sublinear time algorithms for analyzing the cluster structure of graphs with noisy partial information. By using conductance based definitions for measuring the quality of clusters and the cluster structure, we formalize a definition of noisy clusterable graphs with bounded maximum degree. The algorithm is given query access to the adjacency list to such a graph. We then formalize the notion of…

Equations85

ϕ_{G} (S) \leq ϕ_{out}, ϕ (G [S]) \geq ϕ_{in} .

ϕ_{G} (S) \leq ϕ_{out}, ϕ (G [S]) \geq ϕ_{in} .

P_{i} := {u \in V : \textsc W hi c h C l u s t er (u) = i}, 1 \leq i \leq h, B := {u \in V : \textsc I s O u tl i er (u) = Yes} .

P_{i} := {u \in V : \textsc W hi c h C l u s t er (u) = i}, 1 \leq i \leq h, B := {u \in V : \textsc I s O u tl i er (u) = Yes} .

∥ a_{s}^{t} - U_{C_{j}} ∥_{TV} < γ + ε .

∥ a_{s}^{t} - U_{C_{j}} ∥_{TV} < γ + ε .

rcp_{θ} (u, v) = w \in S_{u}^{θ} \cap S_{v}^{θ} \sum a_{u}^{t} (w) a_{v}^{t} (w) .

rcp_{θ} (u, v) = w \in S_{u}^{θ} \cap S_{v}^{θ} \sum a_{u}^{t} (w) a_{v}^{t} (w) .

rcp_{0} (u, v) \leq w \in V \sum a_{v}^{t} (w) \cdot a_{u}^{t} (w) = w \in V \sum a_{v}^{t} (w) \cdot a_{w}^{t} (u) = b_{v}^{t} (u) .

rcp_{0} (u, v) \leq w \in V \sum a_{v}^{t} (w) \cdot a_{u}^{t} (w) = w \in V \sum a_{v}^{t} (w) \cdot a_{w}^{t} (u) = b_{v}^{t} (u) .

rcp_{θ} (u, v) - δ max {rcp_{θ} (u, v), \frac{1}{2 n}} \leq rcp^{'} (u, v) \leq rcp_{0} (u, v) + δ max {rcp_{0} (u, v), \frac{1}{2 n}} .

rcp_{θ} (u, v) - δ max {rcp_{θ} (u, v), \frac{1}{2 n}} \leq rcp^{'} (u, v) \leq rcp_{0} (u, v) + δ max {rcp_{0} (u, v), \frac{1}{2 n}} .

∥ a_{v}^{t} - U_{C_{i}} ∥_{TV} \leq ε^{'} + \frac{κ}{2} \leq κ .

∥ a_{v}^{t} - U_{C_{i}} ∥_{TV} \leq ε^{'} + \frac{κ}{2} \leq κ .

i = 1 \sum h^{'} ∣ P_{σ (i)} △ D_{i} ∣ + j \in [h] ∖ {σ (1), \dots, σ (h^{'})} \sum ∣ P_{j} ∣ + ∣ B ∣ + ∣ B^{'} ∣

i = 1 \sum h^{'} ∣ P_{σ (i)} △ D_{i} ∣ + j \in [h] ∖ {σ (1), \dots, σ (h^{'})} \sum ∣ P_{j} ∣ + ∣ B ∣ + ∣ B^{'} ∣

∣ D_{i} ∖ D_{i} ∣ + ∣ B_{i} ∣ \leq 4 ε^{'} ∣ D_{i} ∣ + ε^{'} n \leq 5 ε^{'} ∣ D_{i} ∣ \leq 5 ε^{'} ∣ C_{i} ∣ \leq \frac{κ}{2} ∣ C_{i} ∣.

∣ D_{i} ∖ D_{i} ∣ + ∣ B_{i} ∣ \leq 4 ε^{'} ∣ D_{i} ∣ + ε^{'} n \leq 5 ε^{'} ∣ D_{i} ∣ \leq 5 ε^{'} ∣ C_{i} ∣ \leq \frac{κ}{2} ∣ C_{i} ∣.

(1 - \frac{κ}{2}) \frac{∣ C _{i} ∣}{n} \cdot ∣ S ∣

(1 - \frac{κ}{2}) \frac{∣ C _{i} ∣}{n} \cdot ∣ S ∣

(1 - κ) \frac{∣ C _{i} ∣}{n} \cdot ∣ S ∣ < (1 - \frac{κ}{2}) (1 - \frac{κ}{2}) \frac{∣ C _{i} ∣}{n} \cdot ∣ S ∣

∥ b_{u}^{t} - U_{A} ∥_{T V}

∥ b_{u}^{t} - U_{A} ∥_{T V}

∥ p_{v}^{t} - U_{C_{i}} ∥_{TV} \leq ξ .

∥ p_{v}^{t} - U_{C_{i}} ∥_{TV} \leq ξ .

∥ v_{i} - \tilde{r}_{i} ∥_{2}^{2} \leq c_{\ref t hm : P S Z_{s} t r u c t u r e} \cdot \frac{h ϕ _{out}}{ϕ _{in}^{2}} .

∥ v_{i} - \tilde{r}_{i} ∥_{2}^{2} \leq c_{\ref t hm : P S Z_{s} t r u c t u r e} \cdot \frac{h ϕ _{out}}{ϕ _{in}^{2}} .

v \sum Y_{v} = v \sum j = 1 \sum h (v_{j} (v) - \tilde{r}_{j} (v))^{2} = j = 1 \sum h ∥ v_{j} - \tilde{r}_{j} ∥_{2}^{2} \leq c_{\ref t hm : P S Z_{s} t r u c t u r e} \cdot \frac{h ^{2} ϕ _{out}}{ϕ _{in}^{2}}

v \sum Y_{v} = v \sum j = 1 \sum h (v_{j} (v) - \tilde{r}_{j} (v))^{2} = j = 1 \sum h ∥ v_{j} - \tilde{r}_{j} ∥_{2}^{2} \leq c_{\ref t hm : P S Z_{s} t r u c t u r e} \cdot \frac{h ^{2} ϕ _{out}}{ϕ _{in}^{2}}

U_{C_{i}} = \frac{1 _{C_{i}}}{∣ C _{i} ∣} = ⟨ 1_{v}, \frac{1 _{C_{i}}}{∣ C _{i} ∣} ⟩ \cdot \frac{1 _{C_{i}}}{∣ C _{i} ∣} = ⟨ 1_{v}, r_{i} ⟩ \cdot r_{i} = j = 1 \sum h r_{j} (v) \cdot r_{j}

U_{C_{i}} = \frac{1 _{C_{i}}}{∣ C _{i} ∣} = ⟨ 1_{v}, \frac{1 _{C_{i}}}{∣ C _{i} ∣} ⟩ \cdot \frac{1 _{C_{i}}}{∣ C _{i} ∣} = ⟨ 1_{v}, r_{i} ⟩ \cdot r_{i} = j = 1 \sum h r_{j} (v) \cdot r_{j}

∥ p_{v}^{t} - U_{C_{i}} ∥_{2}

∥ p_{v}^{t} - U_{C_{i}} ∥_{2}

∥ j = 1 \sum h v_{j} (v) \cdot v_{j} - j = 1 \sum h \tilde{r}_{j} (v) \cdot \tilde{r}_{j} ∥_{2}

∥ j = 1 \sum h v_{j} (v) \cdot v_{j} - j = 1 \sum h \tilde{r}_{j} (v) \cdot \tilde{r}_{j} ∥_{2}

∥ p_{v}^{t} - U_{C_{i}} ∥_{2}

∥ p_{v}^{t} - U_{C_{i}} ∥_{2}

i \in U \sum σ^{(t)} (i) \leq \frac{1}{t} H (σ, τ) + \frac{1}{t} i \in U \sum m = 0 \sum t - 1 τ^{m} (i)

i \in U \sum σ^{(t)} (i) \leq \frac{1}{t} H (σ, τ) + \frac{1}{t} i \in U \sum m = 0 \sum t - 1 τ^{m} (i)

P^{'} = \frac{I + \frac{1}{d} A _{D}}{2} + \frac{A _{1}}{2 d} (\frac{I - \frac{1}{d} A _{B}}{2})^{- 1} \frac{A _{2}}{2 d} = \frac{I + \frac{1}{d} ( A _{D} + A _{1} ( 2 d I - A _{B} ) ^{- 1} A _{2} )}{2} .

P^{'} = \frac{I + \frac{1}{d} A _{D}}{2} + \frac{A _{1}}{2 d} (\frac{I - \frac{1}{d} A _{B}}{2})^{- 1} \frac{A _{2}}{2 d} = \frac{I + \frac{1}{d} ( A _{D} + A _{1} ( 2 d I - A _{B} ) ^{- 1} A _{2} )}{2} .

u \in D_{i}, v \in D ∖ D_{i} \sum π^{'} (u) p_{u}^{'} (v) = \frac{1}{∣ D ∣} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot P^{'} \cdot 1_{v}^{T}

u \in D_{i}, v \in D ∖ D_{i} \sum π^{'} (u) p_{u}^{'} (v) = \frac{1}{∣ D ∣} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot P^{'} \cdot 1_{v}^{T}

u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{D} \cdot 1_{v}^{T} \leq ∣ E_{G} (D_{i}, D ∖ D_{i}) ∣.

u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{D} \cdot 1_{v}^{T} \leq ∣ E_{G} (D_{i}, D ∖ D_{i}) ∣.

\frac{1}{2 d} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{1} A_{2} \cdot 1_{v}^{T}

\frac{1}{2 d} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{1} A_{2} \cdot 1_{v}^{T}

\frac{1}{( 2 d ) ^{j + 1}} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{1} A_{B}^{j} A_{2} \cdot 1_{v}^{T}

\frac{1}{( 2 d ) ^{j + 1}} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{1} A_{B}^{j} A_{2} \cdot 1_{v}^{T}

j = 0 \sum \infty \frac{1}{( 2 d ) ^{j + 1}} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{1} A_{B}^{i} A_{2} \cdot 1_{v}^{T} \leq j = 0 \sum \infty \frac{1}{2 ^{j + 1}} ∣ E_{G} (C_{i}, V ∖ C_{i}) ∣ = ∣ E_{G} (C_{i}, V ∖ C_{i}) ∣

j = 0 \sum \infty \frac{1}{( 2 d ) ^{j + 1}} u \in D_{i}, v \in D ∖ D_{i} \sum 1_{u} \cdot A_{1} A_{B}^{i} A_{2} \cdot 1_{v}^{T} \leq j = 0 \sum \infty \frac{1}{2 ^{j + 1}} ∣ E_{G} (C_{i}, V ∖ C_{i}) ∣ = ∣ E_{G} (C_{i}, V ∖ C_{i}) ∣

u \in D_{i}, v \in D ∖ D_{i} \sum π^{'} (u) p_{u}^{'} (v)

u \in D_{i}, v \in D ∖ D_{i} \sum π^{'} (u) p_{u}^{'} (v)

E [Γ_{s}] \leq 2 ℓ .

E [Γ_{s}] \leq 2 ℓ .

∥ τ_{s} - U_{C_{j}} ∥_{TV}

∥ τ_{s} - U_{C_{j}} ∥_{TV}

i \in U \sum a_{s}^{t} (i) \leq \frac{1}{t} H (σ, τ) + \frac{1}{t} i \in U \sum m = 0 \sum t - 1 τ^{m} (i),

i \in U \sum a_{s}^{t} (i) \leq \frac{1}{t} H (σ, τ) + \frac{1}{t} i \in U \sum m = 0 \sum t - 1 τ^{m} (i),

i \in U \sum (a_{s}^{t} (i) - U_{C_{j}} (i)) \leq \frac{1}{t} H (σ, τ) + \frac{1}{t} i \in U \sum m = 0 \sum t - 1 (τ^{m} (i) - U_{C_{j}} (i))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplexity and Algorithms in Graphs · Advanced Graph Theory Research · Machine Learning and Algorithms

Full text

\newconstantfamily

csymbol=c

\newconstantfamilysmallconstsymbol=κ

Robust Clustering Oracle and Local Reconstructor of Cluster Structure of Graphs

Pan Peng111 Department of Computer Science, University of Sheffield, Sheffield, U.K. Email: [email protected].

Due to the massive size of modern network data, local algorithms that run in sublinear time for analyzing the cluster structure of the graph are receiving growing interest. Two typical examples are local graph clustering algorithms that find a cluster from a seed node with running time proportional to the size of the output set, and clusterability testing algorithms that decide if a graph can be partitioned into a few clusters in the framework of property testing.

In this work, we develop sublinear time algorithms for analyzing the cluster structure of graphs with noisy partial information. By using conductance based definitions for measuring the quality of clusters and the cluster structure, we formalize a definition of noisy clusterable graphs with bounded maximum degree. The algorithm is given query access to the adjacency list to such a graph. We then formalize the notion of robust clustering oracle for a noisy clusterable graph, and give an algorithm that builds such an oracle in sublinear time, which can be further used to support typical queries (e.g., IsOutlier( $s$ ), SameCluster( $s,t$ )) regarding the cluster structure of the graph in sublinear time. All the answers are consistent with a partition of $G$ in which all but a small fraction of vertices belong to some good cluster. We also give a local reconstructor for a noisy clusterable graph that provides query access to a reconstructed graph that is guaranteed to be clusterable in sublinear time. All the query answers are consistent with a clusterable graph which is guaranteed to be close to the original graph.

To obtain our results, we give new analysis of the behavior of random walks on a noisy clusterable graph, which consists of a large subset that induces a clusterable graph and a small unknown subgraph (the noise). We show that a random walk of appropriately chosen length from a typical vertex in a large cluster of the clusterable part will mix well in the corresponding cluster. Using this we are able to distinguish vertices from the clusterable part from those in the noisy part.

1 Introduction

Graph clustering is a fundamental task arising from many domains, including computer science, social science, network analysis and statistics. Given a graph, the task is to group the vertices into reasonably good clusters, where vertices inside the same cluster are well-connected to each other, and any two different clusters are well-separated. Such clusters convey valuable information of large graphs, and have concrete applications in recommendation systems, search engine, network routing and many others (see e.g., surveys [Sch07, POM09, For10, New12]). Many efficient global clustering algorithms that run in polynomial time have been proposed for analyzing the structure of graphs, where the goal is to find the overall cluster structure of a graph. Almost all such algorithms need to at least read the whole input of the graph and thus run in linear time. Actually, even just outputting all the clusters will require $\Omega(n)$ time, where $n$ is the number of the vertices of the graph. These algorithms, though considered to be efficient in the classical algorithm design, are becoming impractical (and sometimes even impossible) to be used for processing and analyzing modern very large networks/graphs (e.g., WWW and social networks).

Therefore, local algorithms that run in sublinear time for analyzing the cluster structure of the graph are receiving growing interest. Such algorithms are typically assumed to be able to explore the input graph by performing appropriate queries, e.g., query the degree or the neighbor of any node. There have been two main frameworks for designing sublinear algorithms for graph clustering, if one uses the well-motivated notion conductance (see below) to measure the quality of clusters. In the first one, called local graph clustering, the goal is to find a cluster from a specified vertex with running time that is bounded in terms of the size of the output set (and with a weak dependence on $n$ ) (see e.g., [ST13, ACL06, AP09, OT12, AOPT16, ZLM13, OZ14]). If the target cluster has much small size, then the running time of the resulting algorithm will be sublinear in the input size. In the second one, called testing cluster structure in the framework of property testing, the goal is to distinguish if an input graph has a typical cluster structure or is far from such cases (see [CPS15, CKK*+*18] and more discussions below). Such algorithms make decisions on the global cluster structure of the input graph by sampling vertices and locally exploring a small portion of the graph, and they can be served as a preliminary step before learning the cluster structure.

In this work, we study local and sublinear algorithms for analyzing the cluster structure of graphs that may contain noise and/or outliers. In many real applications, due to external noise or errors, the network data set may fail to have the desired property (here, the cluster structure), while it might still be close to have this property. That is, the graph $G$ under our consideration is some kind of perturbation of a clusterable graph or a noisy clusterable graph: $G$ is first chosen from some class of clusterable graphs with an underlying while unknown partition, and then some noise and/or outliers are introduced by some adversary or in some random way. This is a relaxation of a common assumption for many existing clustering algorithms that the input graph is simply well clusterable. We would like to very efficiently process such a noisy clusterable graph and extract useful information regarding its cluster structure. Slightly more precisely, we study two types of sublinear algorithms for analyzing the cluster structure of graphs with noisy partial information.

The first type of algorithm is driven by the following natural question: Given a noisy clusterable graph, can we build an oracle (or implicit representation) in sublinear time, that can support typical queries regarding the cluster structure of the graph in sublinear time? For example, we would like to query “Is a vertex $s$ a noise/outlier?”. If the answer is “No”, we would further like to know “Which cluster does $s$ belong to?”, and “Do $s$ and $t$ belong to the same cluster?”, given that both vertices $s,t$ are not outliers. We would require that all the query answers will be consistent, e.g., if $u,v$ are reported to belong to the same cluster, $v,w$ are reported to belong to the same cluster, then $u,w$ will also be reported to belong to the same cluster. Furthermore, we would like to minimize the number of vertices for which the oracle returns the “wrong” answers in the sense that the output partition of the algorithm should be close to an underlying maximal good clustering of the graph. We will call such an oracle as a robust clustering oracle. Such oracles might be already interesting from real-world applications. For example, quickly identifying outliers might be valuable in road networks and medical data. Sometimes, we only want the cluster information of a small group of vertices while do not care about other parts of the graph. Furthermore, it will be desirable to work on-the-fly on a clean data after removing a small fraction of outliers. Besides these real-world applications, such oracles might be given as input for other clustering algorithms that are equipped with the power of making the above mentioned clustering queries (see e.g., [MS17b, MS17a, AKBD16, ABJK18, ABJ18]).

Our second type of algorithm is motivated by a very related question: Given a noisy clusterable graph, can we fix it by minimally modifying the original graph, and provide query access to the reconstructed clusterable graph in sublinear time? We address this question in the online reconstruction framework introduced by [ACCL08]. In this framework (for graphs), given a property $\Pi$ and query access to a graph $G$ that is close to have $\Pi$ , we want to output a graph $G^{\prime}$ such that $G^{\prime}$ has the property $\Pi$ and $G$ is modified minimally to get $G^{\prime}$ . Furthermore, we would like to output $G^{\prime}$ in a local and consistent way that can provide query access to $G^{\prime}$ by making as few queries to the input graph $G$ . The corresponding algorithm will be called a local reconstructor or local filter for property $\Pi$ [ACCL08, SS10, AT10]. The natural application of such local reconstructors is when only a small portion of the corrected graph $G^{\prime}$ is needed or when we want to make use of the graph $G^{\prime}$ in a distributed manner. (Note that in many applications, queries are made to a large graph which are assumed to exhibit some structural property.) Here, we would focus on designing a local filter for cluster structure of graphs and providing consistent query access to a clusterable graph. In practice, such algorithms might be used for fast recommending products to users even if there are some noise in the data.

In this work, we give both sublinear robust clustering oracle and local reconstructors for the cluster structure of graphs. Now we give basic definitions of clusters and (noisy) clusterable graphs, formalize our algorithmic problems, state our main results and sketch our technical ideas.

1.1 Basic Definitions

Conductance based clustering.

Following a recent line of research on graph clustering (e.g., [OT14, CPS15, PSZ17, DPRS19], which were built upon [KVV04]), we will use conductance based definition for measuring the quality of clusters and the cluster structure of graphs. In this paper, we will focus on undirected graphs with bounded maximum degree. We call an undirected graph $G=(V,E)$ a $d$ -bounded graph if its maximum degree is upper bounded by some parameter $d$ , which is always assumed to be some sufficiently large constant (at least $10$ ). For any two subsets $S,T\subseteq V$ , we let $E(S,T)$ denote the set of edges with one endpoint in $S$ and the other point in $T$ . The conductance $\phi_{G}(S)$ of a set $S$ in $G$ is defined to be the ratio between the number of edges crossing $S$ and its complement $V\setminus S$ and the maximum number of edges possible incident to $S$ , that is, $\phi_{G}(S):=\frac{|E(S,V\setminus S)|}{d|S|}.$ The conductance $\phi(G)$ of the graph $G$ is defined to be the minimum value of conductance of set $S$ with size at most $n/2$ , that is, $\phi(G):=\min_{S:|S|\leq n/2}\phi_{G}(S).$ For convenience, for the singleton graph $G$ (that consists of a single vertex with no edges) we define its inner conductance $\phi(G)$ to be $1$ .

Given a vertex set $S\subset V$ , we let $G[S]$ denote the subgraph graph induced by vertices in $S$ . In the following, we will refer to $\phi_{G}(S)$ and $\phi(G[S])$ as the outer conductance and inner conductance, respectively. Given two parameters $\phi_{\textrm{in}}$ and $\phi_{\textrm{out}}$ , we call a set $S$ a $(\phi_{\textrm{in}},\phi_{\textrm{out}})$ -cluster if

[TABLE]

For a good cluster $S$ , we expect $\phi_{\textrm{in}}$ to be large and $\phi_{\textrm{out}}$ to be small. In particular, if $S=V$ and $\phi(G[V])=\phi(G)\geq\phi_{\textrm{in}}\geq\phi$ for some constant $\phi$ , then we call the graph $G$ a $\phi$ -expander which by itself is a good cluster and has been extensively studied in theoretical computer science (see e.g., [HLW06]). It is useful to note that $\phi_{G}(V)=0$ . When $G$ is clear from the context, we omit the subscript $G$ from $\phi_{G}(S)$ . A $k$ -partition of a graph $G=(V,E)$ is a partition of $V$ into $k$ subsets, $V_{1},\cdots,V_{k}$ such that $V_{i}\cap V_{j}=\emptyset$ for $i\neq j$ and $\cup_{i}V_{i}=V$ . We have the following definition of clusterable graphs that characterize graphs with typical cluster structure (see e.g., [OT14]).

Definition 1.1.

Given parameters $d,k,\phi_{\textrm{in}},\phi_{\textrm{out}}$ , we call a $k$ -partition $P_{1},\cdots,P_{k}$ of a $d$ -bounded graph $G$ a $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clustering if for each $i\leq k$ , $\phi(G[P_{i}])\geq\phi_{\textrm{in}}$ and $\phi_{G}(P_{i})\leq\phi_{\textrm{out}}$ .

A $d$ -bounded graph $G$ is called to be $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable if $G$ has an $(h,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clustering for some $h\leq k$ .

Note that in our definition, a $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph may contain less than $k$ clusters, and $(1,\phi_{\textrm{in}},0)$ -clusterable graphs are equivalent to $\phi_{\textrm{in}}$ -expanders.

Clusterable graphs with modeling noise.

We assume that the input graph to the algorithm is generated from the family of all $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graphs and then modified by an adversary in some manner. We have the following definition.

Definition 1.2.

(Clusterable Graphs with Modeling Noise or Noisy Clusterable Graphs) In this model, the adversary first chooses an arbitrary graph $G^{*}$ from the family of all $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graphs with maximum degree upper bounded by $d$ . Then the adversary may do the following:

Choose an arbitrary $(h,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clustering $P_{1},\cdots,P_{h}$ of $G^{*}$ for some $h\leq k$ . 2. 2.

Insert and/or delete at most $\varepsilon\cdot dn$ edges (noise) within the clusters $G^{*}[P_{i}]$ , $1\leq i\leq h$ , while preserving the degree bound.

We call the resulting graph $G$ an $\varepsilon$ -perturbation of $G^{*}$ with respect to the $h$ -partition $P_{1},\cdots,P_{h}$ .

Equivalently, a graph $G$ is called to be an $\varepsilon$ -perturbation of a $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph if there is partition of $G$ with at most $k$ parts (called clusters), such that one can insert/delete at most $\varepsilon dn$ intra-cluster edges to make it a $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph. For simplicity, in the above definition, we only allowed the adversary to perturb the edges inside the clusters, while our algorithm can actually be extended to work for the case that the adversary is also allowed to perturb inter-cluster edges, up to a very limited extent222More precisely, the adversary can be allowed to perturb a $\phi_{\textrm{out}}$ fraction of inter-cluster edges: this essentially can then be reduced to the case that only intra-cluster perturbations are allowed by re-scaling a constant factor of conductance values, i.e., one can view that the adversary first chooses a $(k,\phi_{\textrm{in}},2\phi_{\textrm{out}})$ -clusterable graph and then perturbs its intra-cluster edges.. This definition generalizes the notion of noisy expander graphs studied by Kale, Peres, and Seshadhri [KPS13], which correspond to $k=1$ in our problem. In their setting, the adversary first chooses a $\phi$ -expander and then modifies it by inserting/deleting $\varepsilon$ fraction of edges in the graph.

1.2 Problem Formalizations and Main Results

Now we formalize our algorithmic problems and present our main results. For a $d$ -bounded graph $G$ , we will assume the algorithm is given query access to the adjacency list of $G$ , that is, in constant time we can query the $i$ -th neighbor of any vertex $v$ .

Robust clustering oracle.

Given query access to the adjacency list of a $d$ -bounded graph $G$ that is promised to be an $\varepsilon$ -perturbation of a $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph, we are interested in constructing an implicit representation, called a robust clustering oracle, of $G$ in sublinear time such that typical queries regarding the cluster structure of $G$ can be answered as quickly as possible (also in sublinear time). More precisely, the oracle should support the following types of clustering queries:

IsOutlier( $s$ ): Is a vertex $s$ a noise/outlier?

Intuitively, a vertex that does not belong to any good cluster should be reported as noise or outlier. For any non-outlier vertices $s,t$ , the oracle can further support

WhichCluster( $s$ ): Which cluster does $s$ belong to?

3)

SameCluster( $s,t$ ): Do $s$ and $t$ belong to the same cluster?

In the following, without loss of generality, we will assume that for any non-outlier vertex $s$ and the corresponding WhichCluster( $s$ ) query, the oracle will output an integer $i$ with $1\leq i\leq h$ that specifies the index of the cluster that $s$ belongs to, for some integer $h$ . Furthermore, given the ability of answering WhichCluster queries, for any two non-outlier vertices $s,t$ , we simply define SameCluster( $s,t$ ) to be the procedure that checks if WhichCluster( $s$ ) is equal to WhichCluster( $t$ ). This will naturally ensures the consistency for SameCluster queries. Note that the output of the algorithm naturally defines a partition of $V$ , i.e.,

[TABLE]

We would like to minimize the number of vertices for which the oracle returns the “wrong” answers. That is, for most vertices $v$ that do belong to some underlying good cluster in the perturbed $G$ , we expect IsOutlier( $v$ ) to return “No”. Furthermore, for most vertices $u,v$ that belong to the same cluster (resp. different clusters), we expect SameCluster( $u,v$ ) to return “Yes” (resp. “No”). One further crucial requirement of a robust clustering oracle and the corresponding clustering query algorithm is to maintain consistency among all queries. That is, on different query sequences, the answers of the oracle should be consistent with the same $h$ -partition $D_{1},\cdots,D_{h}$ of $V$ for some $h\leq k$ , in which all but a small fraction of vertices belong to some good cluster. Since the oracle construction and the corresponding query algorithm are typically randomized, we fix the randomness seed of the oracle and query algorithm once and for all to ensure consistent answers. Then the algorithm will be a deterministic procedure for any input query, which further guarantees that the partition $D_{1},\cdots,D_{h}$ is determined by $G$ and the internal randomness of the oracle and the algorithm, and is independent of the order of queries. This feature allows the oracle to be used in the distributed manner as consistency is guaranteed.

We provide the first robust clustering oracle with both sublinear preprocessing time and query time. For simplicity, we will assume both $d,k$ are constant throughout the paper. Let $P\triangle Q$ denote the symmetric difference between two vertex sets $P,Q$ .

Theorem 1.3 (Robust Clustering Oracle).

There exists an algorithm that takes as input parameters $n\geq 1$ , $d>10$ , $k\geq 1$ , $\phi\in(0,1)$ , $\varepsilon\in[\Omega(\frac{\phi}{{n}}),1]$ and has query access to the adjacency list of a graph $G=(V,E)$ that is an $\varepsilon$ -perturbation of a $(k,\phi,O(\frac{\varepsilon\phi}{k^{3}\log n}))$ -clusterable graph, and constructs a robust clustering oracle in $O(\sqrt{n}\cdot\textrm{poly}(\frac{k\cdot\log n}{\phi\varepsilon}))$ pre-processing time. Furthermore, it holds that

Using the oracle, the algorithm can answer any clustering query (i.e., IsOutlier, WhichCluster or SameCluster) in $O(\sqrt{n}\cdot\textrm{poly}(\frac{k\cdot\log n}{\phi\varepsilon}))$ time. 2. 2.

There exists a partition $D_{1},\cdots,D_{h^{\prime}},B^{\prime}$ of $G$ , for some $h^{\prime}\leq k$ , such that

•

the partition only depends on $G$ and the input parameters of the algorithm, and is independent of the order of queries;

•

if $\varepsilon\in[\Omega(\frac{\phi}{{n}}),\frac{\phi}{60k^{2}}]$ , then $h^{\prime}\geq 1$ and each $D_{i}$ is a $(\frac{\phi}{2},\frac{a_{\ref{thm:rw_perturbed}}\sqrt{\varepsilon}\kappa^{4}\phi^{1.5}}{3k^{3}\log n})$ -cluster, for any $1\leq i\leq h^{\prime}$ ; if $\varepsilon\in(\frac{\phi}{60k^{2}},1]$ , then $h^{\prime}=0$ ; and

•

with probability at least $1-\frac{1}{n}$ , the partition $P_{1},\cdots,P_{h},B$ output by the algorithm satisfies that $h^{\prime}\leq h\leq k$ and $\sum_{i=1}^{h^{\prime}}|P_{i}\triangle D_{i}|+|(\cup_{i=h^{\prime}+1}^{h}P_{i})\cup B|+|B^{\prime}|=O(k\sqrt{\frac{\varepsilon}{\phi}}n)$ .

We remark that there is no algorithm that allows both $o(\sqrt{n})$ pre-processing time and $o(\sqrt{n})$ query time for IsOutlier queries, as otherwise, one could obtain a property testing algorithm for expansion with $o(\sqrt{n})$ queries, which will be a contradiction to a known lower bound [GR00] (see more discussions below on relation to property testing). Furthermore, the second item of the theorem implies that the total number of vertices that are reported as outliers is at most $O(k\sqrt{\frac{\varepsilon}{\phi}}n)$ and that the query answers are consistent with a partition of $G$ in which all but $O_{k}(\sqrt{\frac{\varepsilon}{\phi}}n)$ vertices belong to a $(\frac{\phi}{2},O_{k}(\frac{\sqrt{\varepsilon}\phi^{1.5}}{\log n}))$ -cluster. We also note that in the statement of the above theorem, the most interesting range of $\varepsilon$ is333Note that in this range, $\varepsilon=O(\frac{\phi}{k^{2}})$ , which is also the reason that we do see the traditional $\phi^{2}$ dependency (from Cheeger’s inequality) between the outer conductance and inner conductance. $\varepsilon\in[\Omega(\frac{\phi}{{n}}),\frac{\phi}{60k^{2}}]$ , as otherwise (i.e., $\varepsilon>\frac{\phi}{60k^{2}}$ ) the noise will be too much and our algorithm cannot guarantee to locally identify even one cluster. Removing the $\log n$ gap between the inner conductance and outer conductance seems to be hard, at least for methods that are based on random walk distances (as we used here). For example, in [CKK*+*18], it has been discussed that in general, it is impossible to use Euclidean distance between random walk distributions to test $2$ -clusterablity if one wants the gap to be a constant. (Testing $2$ -clusterability is an easier problem than the robust clustering oracle problem; see below.) On the other hand, being able to correctly answer SameCluster( $u,v$ ) queries intuitively requires or induces a distance based approach, as the vertices in the same cluster are “similar” or “close to” each other, while vertices in different clusters are “dissimilar” or “far from” each other.

Local reconstructor of graph cluster structure.

We are interested in designing a local reconstruction algorithm for the cluster structure of graphs. Given query access to the adjacency list of a $d$ -bounded graph $G$ that is promised to be an $\varepsilon$ -perturbation of a $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph, our goal is to design a local filter that provides query access to a $(k,\phi_{\textrm{in}}^{\prime},\phi_{\textrm{out}}^{\prime})$ -clusterable graph $G^{\prime}$ such that the distance between $G$ and $G^{\prime}$ is as close as possible. That is, we would like to output $G^{\prime}$ in a local manner that for any vertex query, the neighborhood of $v$ , i.e., the set of all neighbors of $v$ , in $G^{\prime}$ can be answered in sublinear time (in particular, by making as few queries to the adjacency list to $G$ as possible). Similar as for the robust clustering oracle, it is crucial to require a local filter to maintain consistency among all queries. Here we require that for different query sequences, the answers of the filter should be consistent with the same reconstructed graph $G^{\prime}$ . Again, the filter is suitable to be used in the distributed manner as consistency is guaranteed. In our local filter for clusterable graphs, we also aim to make the gap between $\phi_{\textrm{in}},\phi_{\textrm{out}}$ and the gap between $\phi_{\textrm{in}}$ and $\phi_{\textrm{in}}^{\prime}$ as small as possible. We next state our theorem regarding our local filter for clusterable graphs as follows.

Theorem 1.4 (Local Reconstructor of Cluster Structure).

There exists a local reconstruction algorithm that takes as input parameters $n\geq 1$ , $d>10$ , $k\geq 1$ , $\phi\in(0,1)$ , $\varepsilon\in[\Omega(\frac{\phi}{{n}}),1]$ and has query access to the adjacency list of a graph $G=(V,E)$ that is an $\varepsilon$ -perturbation of a $(k,\phi,O(\frac{\varepsilon\phi}{k^{3}\log n}))$ -clusterable graph, and provides query access to a graph $G^{\prime}=(V,E^{\prime})$ such that the following holds with probability at least $1-\frac{4}{n}$ :

$G^{\prime}$ * is $(k,\Omega(\frac{\varepsilon\phi}{k^{4}\log n}),1)$ -clusterable, and has maximum degree at most $d+16$ .* 2. 2.

The number of edges changed is at most $O(\min\{1,k\sqrt{\frac{\varepsilon}{\phi}}\}\cdot n)$ . 3. 3.

$G^{\prime}$ * is determined by $G$ and the internal randomness of the algorithm, and is independent of the order of queries.* 4. 4.

On each query $v$ , the neighborhood of $v$ in $G^{\prime}$ can be answered in $O(\sqrt{n}\cdot\textrm{poly}(\frac{k\cdot\log n}{\phi\varepsilon}))$ time.

Note that by Item 1, the resulting graph can be partitioned into at most $k$ parts, each with relatively large inner conductance (i.e., $\Omega_{k}(\frac{\varepsilon\phi}{\log n})$ ), with no guarantee on outer conductance (as each set trivially has outer conductance at most $1$ ). (Such instances are exactly the object that was studied in [CKK*+*18] in the framework of property testing.) By sacrificing the inner conductance quality, we can also find a clustering of $G^{\prime}$ with small outer conductance. That is, we can guarantee that $G^{\prime}$ is also $(k,\Omega(\frac{\nu^{k}}{6^{k}k^{4}}\frac{\varepsilon\phi}{\log n}),\min\{k\nu,1\})$ -clusterable for any $\nu\in[0,1]$ (see Appendix C for details). Item 3 implies that all query answers are consistent, that is, the vertex $u$ is output as a neighbor of $v$ in $G^{\prime}$ if and only if $v$ is output as a neighbor of $u$ . From the discussion below on the connections between our local reconstruction algorithm and property testing, the running time of our filter is optimal (in terms of dependency on $n$ ) up to polylogarithmic factors.

Furthermore, our algorithm generalizes the local reconstruction algorithm for expander graphs by [KPS13], which corresponds to the special case $k=1$ in our problem, though our approximation ratio of the number of modified edges is worse. More precisely, for $\varepsilon=\Omega(\phi)$ , both our algorithm and the algorithm in [KPS13] will add $\Theta(dn)$ edges (as the noise part is too large, and thus almost all vertices will be reported as outliers and the resulting graph is almost the complete hybrid of the original graph and an explicitly constructible expander (see Section 1.3 for more discussions)); for $\varepsilon=O(\phi)$ , the algorithm in [KPS13] reconstructs a graph that is an $\varepsilon$ -perturbation of a $\phi$ -expander by modifying at most $O(\frac{\varepsilon}{\phi}n)$ edges, and the resulting graph has conductance at least $\Omega(\frac{\phi^{2}}{\log n})$ and maximum degree also upper bounded by444Note that [KPS13] claimed that the number of modified edges is at most $O(\frac{\phi}{\log n}\varepsilon n)$ and the maximum degree of the resulting graph is $d+O(\lceil\frac{d\phi^{2}}{\log n}\rceil)$ . However, this claim is not correct (at least for $d$ -bounded graphs with $d$ being constant), and the number of changed edges and the maximum degree bound from their analysis should be $O(\frac{\varepsilon}{\phi}n)$ and $d+16$ , respectively [Ses19]. They obtained their claimed results by adding $t:=\lceil\frac{d\phi^{2}}{c\log n}\rceil$ parallel edges while repairing bad vertices, from which they get that the maximum degree is $d+16t$ and the number of added edges to the optimal distance (i.e., $\varepsilon dn$ ) is $\frac{16t}{d\phi}=O(\phi/\log n)$ , which is incorrect as it always holds that $t=1$ for constant $d$ and large enough $n$ . $d+16$ , while our algorithm has to modify $O(\sqrt{\frac{\varepsilon}{\phi}}\cdot kn)$ edges. We further note that the algorithm in [KPS13] guarantees that the reconstructed graph has inner conductance at least $\Omega(\frac{\phi^{2}}{\log n})$ , while the resulting graph from our algorithm is guaranteed to have a partition with at most $k$ parts, each with inner conductance at least $\Omega_{k}(\frac{\varepsilon\phi}{\log n})$ . Removing the $\log n$ factor in the inner conductance of the output graph seems to be a very challenging task, even for the case $k=1$ . See Section 6 for more discussions.

Local mixing property on noisy clusterable graphs.

In order to derive the above algorithmic results, we prove an interesting behavior, which we call local mixing property, of random walks on noisy clusterable graphs. For technical reasons, we will consider the uniform averaging walk of $t$ steps on a graph $G$ : In this walk, we choose a number $\ell\in\{0,1,2,\cdots,t-1\}$ uniformly at random, and stop the (normal) random walk after $\ell$ steps. We let ${\textbf{a}}_{v}^{t}$ denote the probability vector for a uniform averaging walk of $t$ steps starting at $v$ and let $\lVert\textbf{p}_{1}-\textbf{p}_{2}\rVert_{\textrm{TV}}$ denote the total variance distance between two distributions $\textbf{p}_{1},\textbf{p}_{2}$ . We have the following theorem.

Theorem 1.5 (Local Mixing Property of Random Walks).

Let $0<\gamma,\varepsilon<1$ . Let $\phi_{\textrm{out}}\leq\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\gamma^{4}\phi_{\textrm{in}}^{2}}{k^{3}\log n}$ for some sufficiently small constant $a_{\ref{thm:rw_perturbed}}>0$ . Let $G$ be a $d$ -bounded graph with an $h$ -partition $C_{1},C_{2},\cdots,C_{h}$ such that $\phi_{G}(C_{i})\leq\phi_{\textrm{out}}$ for any $1\leq i\leq h\leq k$ . For each $i\leq h$ , we let $D_{i}\subseteq C_{i}$ denote a large subset of vertices such that $\phi(G[D_{i}])\geq\phi_{\textrm{in}}$ , and let $B_{i}：=C_{i}\setminus D_{i}$ . If $\sum_{i}|B_{i}|\leq\varepsilon n$ , then for any $D_{j}$ with $|D_{j}|\geq 3\sqrt{\varepsilon}n$ , there exists a subset $\widehat{D}_{j}\subseteq D_{j}$ such that $|\widehat{D}_{j}|\geq(1-4\sqrt{\varepsilon})|D_{j}|$ such that for any $s\in\widehat{D}_{j}$ , and $t=\frac{120\log n}{\gamma\phi_{\textrm{in}}^{2}}$ , it holds that

[TABLE]

Intuitively, the set $B_{i}$ corresponds to the noisy part inside each cluster $C_{i}$ and we assume that the total fraction of noisy part is parametrized by $\varepsilon$ . Then the above theorem says that the rest of the large part (i.e., clusterable part) exhibits some nice local mixing property: a typical uniform averaging random walk (of appropriately chosen length) from a large cluster (of size $\Omega(\sqrt{\varepsilon}n)$ ) will converge quickly to the uniform distribution on it. This is a generalization of the global mixing property of noisy expander graphs in [KPS13], though their results are stated for the more general Markov chains.

1.3 Our Techniques

To design a robust clustering oracle, we first note that it is relatively easy to design a clustering oracle without noise (if the gap between $\phi_{\textrm{in}}$ and $\phi_{\textrm{out}}$ is $O(\log n)$ as we considered here). This can be done by a refined analysis of the property testing algorithm in [CPS15] that samples a small number of vertices, and then test if the $\ell_{2}$ norm distance between the random walk distributions from any two vertices is larger than some threshold or not. However, the analysis depends on the spectral property (e.g., a gap between $\lambda_{k}$ and $\lambda_{k+1}$ ) of clusterable graphs, and cannot be easily generalized to the case that the input graph contains noise, as such spectral property is very sensitive to noise (e.g., deleting all edges incident to a constant number of vertices will break down the property).

In order to handle noisy input, we use the $\ell_{1}$ norm distance between the corresponding random walk distributions to test if the starting two vertices belong to the same cluster or not, and we make use of the local mixing property of random walks in Theorem 1.5. In order to prove the such a mixing property, we first show that it does hold for clusterable graphs without noise, by exploiting a spectral property that characterizes the first $k$ eigenvectors of clusterable graphs given by [PSZ17]. To generalize the result to a noisy clusterable graph $G$ , we view the random walks on the graph as a Markov chain and consider a new Markov chain that is induced on vertices in the clusterable part in $G$ . (Such a new chain has also been used in [KPS13] for analyzing noisy expanders.) We show the induced Markov chain does correspond to a clusterable graph $H$ (by overcoming the difficulty that the outer conductance of each corresponding cluster increases and might change the cluster structure too much) and thus the random walks in $H$ satisfy the local mixing property. However, the walks on $H$ can be very different from the random walks in the original graph $G$ . We then give a novel application of an old technique called stopping rules of Markov chains that was introduced by Lovósz and Winkler [LW97] to relate these two walks, and bound the total variance distance between two random walk distributions from a vertex in any large cluster of $G$ and $H$ . This allows us to show the local mixing property in the graph $G$ . To the best of our knowledge, we are the first to use of the tool of stopping rules to show that a random walk in the graph mixes inside a subgraph (i.e., cluster) rather than in the whole graph.

Given such a local mixing property of random walks in the noisy clusterable graph, we are able to design a robust clustering oracle and the corresponding clustering query algorithm with sublinear preprocessing and query time. We first note that if the noisy part is not too large (i.e., $\varepsilon=O(\phi/k^{2})$ ), then the graph $G$ has a non-trivial partition $D_{1},\cdots,D_{h^{\prime}},B^{\prime}$ with $h^{\prime}\geq 1$ that only depends on the corresponding parameters (i.e., $\varepsilon,\phi,n$ ) and $G$ itself, and that each $D_{i}$ is a good cluster with large size (containing at least $\Omega(\sqrt{\varepsilon})$ fraction of vertices), and $B^{\prime}$ has small size. Our key idea is to use random walks to learn a succinct representation $H$ , which is a weighted graph with roughly $O(\log n)$ vertices, of the clusterable part of graph $G$ , such that each cluster $D_{i}$ in $G$ will be mapped to a unique clique (called a core) in $H$ with appropriate edge weight. Furthermore, by using the weights and the size bounds of these cliques, we can be efficiently identify them from $H$ , using which we are able to answer the WhichCluster queries. Slightly more precisely, in the preprocessing (or learning) phase, the algorithm samples a set $S$ of $\Theta(\log n)$ vertices, and uses the statistics of $\tilde{O}(\sqrt{n})$ random walks from each sampled vertex to (quite accurately) estimate the so-called reduced collision probability (rcp) of (the random walks of appropriate length from) any two sampled vertices that was introduced in [KPS13]. We construct a weighted similarity graph $H$ on the sample set $S$ such that the weight of each edge $(u,v)$ is our estimate of the rcp of $u,v$ , for any $u,v\in S$ . We show that if the noisy part is not too large, then, by the aforementioned local mixing property, for (most) pair of vertices $u,v\in S\cap D_{i}$ , the rcp of $u,v$ will be close to $1/|D_{i}|$ . Thus, the weight of edge $(u,v)$ in $H$ will be set to be a number close to $1/|D_{i}|$ , and most vertices in $S\cap D_{i}$ form a clique $S_{i}$ in $H$ with edge weights close to $1/|D_{i}|$ . We further observe that $S_{i}$ has relatively large size (roughly $|S|\cdot\frac{|D_{i}|}{n}$ ), as $|D_{i}|$ is large; and that any vertex $v\in S_{i}$ can only belong to exactly one such (large) clique, as otherwise, the total probability mass of random walk distribution from $v$ will exceed $1$ , which can not happen. These properties allow us to efficiently identify the unique core $S_{i}$ from $H$ that corresponds to the cluster $D_{i}$ by a simple greedy algorithm and further to answer membership queries. We remark that in [CPS15], a similarity graph is also constructed, while that graph is unweighted and only tells if the original graph is $k$ -clusterable or not according to the number of connected components, which is far from sufficient for our application.

Then in the query phase, we check if the queried vertex $v$ belongs to any of the learned cores or not to decide if it is an outlier or not. This, again, can be done by estimating the rcp of the walks from $v$ and other vertices in $S$ (by running $\tilde{O}(\sqrt{n})$ random walks), and is guaranteed by the local mixing property of random walks. In particular, for most vertices $v$ in a cluster $D_{i}$ , the rcp of random walks from $v$ and any other vertex that is in $S_{i}$ corresponding to $D_{i}$ will be also around $1/|D_{i}|$ . If this is the case, we output $i$ as the index of the cluster that $v$ belongs to; otherwise, we report it as an outlier. The above analysis shows that most vertices in $D_{1},\cdots,D_{h^{\prime}}$ will be correctly classified, or equivalently, the number of vertices that are reported as outliers is small.

Our local reconstruction algorithm for clusterable graphs is built upon our robust clustering oracle. That is, we first learn the cores of the input graph as before. Then (if the noisy part is not too large) we only “repair” all the vertices that are reported as outliers. Let $v$ be any vertex that is reported as an outlier. We add all the neighbors of $v$ in an explicit expander $G_{\exp}$ to “repair” the graph $G$ , which is called a hybridization (between $G_{\exp}$ and $G$ ) and has been used to repair expander graphs in [KPS13]. Then the answers is guaranteed to be consistent with a graph $G^{\prime}$ such that its distance to the original graph $G$ is at most $d$ times the number of vertices that are reported as outliers, which has already been bounded to be small. In order to prove the claimed guarantee on cluster structure of $G^{\prime}$ , we introduce a definition of weak vertices that intuitively correspond to the noisy part of the graph. Such a definition has also been used in [KPS13], though ours is more subtle, depending on the size of noise. We can show that one can improve the cluster structure of the graph if we have repaired all the weak vertices in the above way. Furthermore, such weak vertices will always be reported as outliers, which is guaranteed by the performance of our robust clustering oracle.

1.4 Relation to Testing Graph Clusterability

Both the above robust clustering oracle and local reconstruction are closely related to the framework of property testing [RS96, GGR98]. In the bounded degree graph property testing [GR02], given a property $\Pi$ , the algorithm aims to distinguish graphs that satisfy $\Pi$ from graphs that are $\varepsilon$ -far from satisfying $\Pi$ by making as few queries (to the adjacency list of the graph) as possible, with high constant probability, say at least $2/3$ . Here, a graph is said to be $\varepsilon$ -far from satisfying property $\Pi$ if one has to modify more than $\varepsilon dn$ edges to make it satisfy $\Pi$ , while preserving the degree bound. After two decades of study, a number of properties of bounded degree graphs are now known to be testable in constant time [GR02, BSS10, HKNO09, NS13], $\tilde{O}(\sqrt{n})$ or $\tilde{O}(n^{\frac{1}{2}+c})$ time [GR98, GR00, CS10, KS11, NS10, CPS15, CKK*+*18, KSS18].

In particular, for the property of being $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable, [CPS15] gave a testing algorithm that runs in time $\tilde{O}(\sqrt{n}\textrm{poly}(\phi,k,1/\varepsilon))$ and distinguishes $(k,\phi,O(\frac{\phi^{2}\varepsilon^{4}}{k^{\Omega(1)}}))$ -clusterable graphs from graphs that are $\varepsilon$ -far from being $(k,\Theta(\frac{\phi^{2}\varepsilon^{4}}{k^{\Omega(1)}\log n}),\psi)$ -clusterable, for any $\psi\in[0,1]$ . (Note that the algorithm rejects any graph that is far from clusterable graphs with arbitrary outer conductance.) [CKK*+*18] recently improved this algorithm by giving an algorithm for testing if a graph contains at most $k$ subsets with inner conductance at least $\phi$ from those that can be decomposed into at least $k+1$ subsets with size at least $\Omega_{k}(n)$ and outer conductance at most $O(\mu\phi^{2})$ in time $O(n^{1/2+O(\mu)})$ for any $\mu$ that is smaller than some constant (they also generalize their algorithm for general graphs). For the case of $k=1$ , i.e., testing if the graph has expansion at least $\phi$ , the best known algorithm can test if a graph has expansion $\phi$ or is $\varepsilon$ -far from having expansion $\Theta(\mu\phi^{2})$ in time $\tilde{O}(n^{0.5+\mu})$ for any $\mu>0$ ([KS11, NS10] which improves upon [CS10]). Furthermore, there exists a lower bound of $\Omega(\sqrt{n})$ on the query complexity for testing expansion [GR02].

Note that both the robust clustering oracle problem and the reconstruction problem are always much harder than the property testing version (see e.g., [KPS13]). For example, in the oracle problem, we need to figure out the cluster structure of the clusterable graph, and in the local reconstruction problem, the algorithm actively repairs the input graph, while the property testing is a decision problem. Furthermore, property testing only needs to distinguish between graphs which are clusterable and those are $\varepsilon$ -far from being clusterable, while both the clustering oracle and the reconstruction have to (in some sense) approximate the distance to the class of all clusterable graphs555Actually, in our setting, we are approximating the intra-perturbation distance to the class of all clusterable graphs, i.e., the minimum number of intra-cluster edges needed to be modified to obtain a clusterable graph over all possible $h$ -partitions, for some $h\leq k$ . This is in contrast to approximating the distance to all clusterable graphs, which is the minimum number of edges needed to be modified to obtain a clusterable graph.. Thus, the property testing algorithms can not be directly used to or easily modified to give a robust clustering oracle or local reconstruction algorithm. In particular, even for the case that the input graph is clusterable, one cannot use the corresponding property testing algorithm (on the clusterable graph) to answer SameCluster queries. Actually, both algorithms in [CPS15, CKK*+*18] make decisions based on some small summarizations of the input graph which are constructed by a small sample of vertices and the corresponding random walk statistics. Such small summarizations can be used to distinguish if the graph is $k$ -clusterable or is far from being $k$ -clusterable. However, if the graph is indeed $k$ -clusterable, they cannot be used to distinguish if two vertices are from the same cluster or are from two different clusters. As we mentioned before, in [CKK*+*18], evidence has been provided that in general it is not possible to use pairwise Euclidean distances between two random walk distributions to distinguish between $2$ -clusterable graphs and far from $2$ -clusterable graphs if the gap between conductances is constant.

On the other hand, property testing algorithms can always be obtained from the corresponding local reconstruction ones (which has already been noted in previous work on local reconstruction) and testing $k$ -clusterability can also be obtained from our robust clustering oracle algorithm. This is also true in our scenario since we can estimate the distance between $G$ and a clusterable graph $G^{\prime}$ with small additive error by sampling a constant number of vertices and running the oracle and clustering query algorithm (or the local reconstruction algorithm) on each sampled vertex to obtain the fraction of outlier vertices. We further note that if a graph $G$ is $\varepsilon$ -far from any $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph, then it cannot be an $\varepsilon$ -perturbation of any such clusterable graph (i.e., one has to perturb more than an $\varepsilon$ -fraction of edges). Therefore, both our robust clustering oracle and local reconstructor algorithm lead to a property testing algorithm that distinguishes $(k,\phi,O(\frac{\varepsilon\phi}{k^{3}\log n}))$ -clusterable graphs from graphs that are $\varepsilon$ -far from being $(k,\Omega(\frac{\nu^{k}}{6^{k}k^{4}}\frac{\varepsilon\phi}{\log n}),k\nu)$ -clusterable for any $\nu\in[0,1]$ , with probability at least $2/3$ . The running time of the algorithm is $\tilde{O}(\sqrt{n})$ , which is optimal up to polylogarithmic factors due to the $\sqrt{n}$ lower bound on the number of queries for testing expansion (corresponding to $k=1$ in our problem) [GR02].

1.5 Other Related Work

The study on local graph clustering [ST13, ACL06, AP09, OT12, AOPT16, ZLM13, OZ14] is also closely related to our work. In this framework, the goal is to find a cluster from a specified vertex with running time that is bounded in terms of the size of the output set (and with a weak dependence on $n$ ). In the scenario where both inner and outer conductance are used for measuring the quality of clusters, [ZLM13] gave a local clustering algorithm that outputs a set with conductance at most $\tilde{O}(\min\{\sqrt{\phi_{G}(A)},\phi_{G}(A)/\sqrt{\textrm{Conn}(A)}\})$ where $A$ is the target set, and $\textrm{Conn}(A)$ is the reciprocal (e.g., $\phi(G[A])^{2}/(\log\textrm{vol}(A))$ ) of the mixing time of the random walk over the induced subgraph $G[A]$ on $A$ and $\textrm{vol}(A)$ is the total degree of vertices in $A$ . It is also shown that the conductance guarantee ${\phi_{G}(A)}/{\sqrt{\textrm{Conn}(A)}}$ is tight among (some class of) random-walk based local algorithms [ZLM13]. It might be interesting to note the logarithmic factor (i.e., $\log(\textrm{vol}(A))$ ) dependency appeared in these guarantees. The performance guarantee has later been improved by [OZ14] using a flow-based local improvement algorithm that finds a set with conductance $\psi=O(\phi_{G}(A))$ , volume $O(\textrm{vol}(A))$ and runs in time $\tilde{O}(\textrm{vol}(A)/\psi)$ , where $A$ is the target set with $\textrm{Conn}(A)/\phi_{G}(A)=\Omega(1)$ . Note that the running times of these algorithms are sublinear only if the size (or volume) of the target set is small (say, at most $o(n)$ ), while in our setting, the clusters of interest have at least linear size (for any constant $\varepsilon$ ).

Fully or partially recovering the clusters in the noisy model has been extensively studied in the “global algorithm regimes”. Examples include recovering the planted partition in stochastic block model with modeling errors or noise (e.g., [CL15, GV16, MPW16, MMV16]), correlation clustering on different ground-truth graphs in the semi-random model (e.g., [MS10, CJSX14, GRSY14, MMV15]) and partitioning the graph in the average-case model [MMV12, MMV14, MMV15]. All these algorithms run in at least linear time.

Local reconstruction of some other properties have been investigated before. Such properties include expanders [KPS13], graph connectivity and diameter [CGR13], bipartite and $\rho$ -clique dense graphs [Bra08], geometric properties [CS11], monotone functions [ACCL08, SS10], Lipschitz functions [JR13] and low rank matrices and subspaces [DGK17]. This algorithmic framework is also closely related to local decodable codes (e.g., [STV99]) and local decompression [DLRR13]. The local reconstruction model has been generalized to local computation model by Rubinfeld et al. [RTVX11, ARVX12], and a number of problems like maximal independent set, hypergraph coloring and maximum matching have been investigated in this model [RTVX11, ARVX12, MRVX12, MV13].

Organization of the paper.

We give preliminaries in Section 2. In Section 3, we give the algorithm and the analysis for our robust clustering oracle and prove Theorem 1.3. Then, we give our local reconstruction algorithm, its analysis and prove Theorem 1.4 in Section 4. Both proofs for Theorem 1.3 and 1.4 will rely on the local mixing property of random walks in noisy clusterable graphs, i.e., Theorem 1.5, which we prove in Section 5. We conclude in Section 6.

2 Preliminaries

Let $G=(V,E)$ denote an $n$ -vertex undirected graph $G$ with maximum degree bounded by some constant $d$ , where $V=[n]:=\{1,\cdots,n\}$ . For each vertex $v$ , we let $d_{v}$ denote its degree. Throughout the paper, all the vectors will be row vectors unless otherwise specified or transposed to column vectors. For a vector $\mathbf{x}$ , we let $\lVert\mathbf{x}\rVert_{1}:=\sum_{i}\lvert\mathbf{x}(i)\rvert$ and $\lVert\mathbf{x}\rVert_{2}:=\sqrt{\sum_{i}\mathbf{x}(i)^{2}}$ to denote its $\ell_{1}$ norm and $\ell_{2}$ norm, respectively. Let $\textbf{1}_{S}$ denote the indicator vector of set $S$ , that is $\textbf{1}_{S}(u)=1$ if $u\in S$ and [math] otherwise. Let $\textbf{1}_{v}:=\textbf{1}_{\{v\}}$ . Let $\mathbf{\mathcal{U}}_{S}:=\frac{\textbf{1}_{S}}{|S|}$ denote the uniform distribution on set $S$ . For any set $X$ of vectors $\mathbf{x}_{1},\cdots,\mathbf{x}_{s}$ , we let $\textrm{span}(X)=\textrm{span}(\mathbf{x}_{1},\cdots,\mathbf{x}_{s})$ denote the linear span of $X$ , that is $\textrm{span}(X)=\{\sum_{i=1}^{s}\mu_{i}\mathbf{x}_{i}|\mu_{i}\in\mathbb{R}\}$ . For a vector $\mathbf{x}$ and a set $S$ , we let $\mathbf{x}(S):=\sum_{v\in S}\mathbf{x}(S)$ . For two distributions $\textbf{p}_{1}$ and $\textbf{p}_{2}$ , we let $\lVert\textbf{p}_{1}-\textbf{p}_{2}\rVert_{\textrm{TV}}$ denote the total variance distance between $\textbf{p}_{1},\textbf{p}_{2}$ . It is known that $\lVert\textbf{p}_{1}-\textbf{p}_{2}\rVert_{\textrm{TV}}=\frac{1}{2}\lVert\textbf{p}_{1}-\textbf{p}_{2}\rVert_{1}$ .

Different types of random walks on $G$ .

We will consider the following random walks.

(1) (Normal) random walk of $t$ steps. In a (normal) random walk, at each step, suppose we are at vertex $v$ , then we jump to a random neighbor with probability $\frac{1}{2d}$ and stay at $v$ with the remaining probability $1-\frac{d_{v}}{2d}$ . We stop the walk after $t$ steps. We let $\textbf{p}_{v}^{t}$ denote the probability vector for a $t$ step random walk starting at $v$ .

(2) Uniform averaging walk of $t$ steps. In this walk, we choose a number $\ell\in\{0,1,2,\cdots,t-1\}$ uniformly at random, and stop the (normal) random walk after $\ell$ steps. We let ${\textbf{a}}_{v}^{t}$ denote the probability vector for a uniform averaging walk of $t$ steps starting at $v$ .

(3) Uniform averaging walk of $t$ steps with two phases. In this walk, we choose two integers $\ell_{1},\ell_{2}\in\{0,1,2,\cdots,t-1\}$ uniformly at random, and stop the walk after $\ell_{1}+\ell_{2}$ steps. We let ${\textbf{b}}_{v}^{t}$ denote the probability vector for a uniform averaging walk of $t$ steps with two phases starting at $v$ .

It is useful to note that for any two vertices $u,v$ , ${\textbf{b}}_{u}^{t}(v)=\sum_{w\in V}{\textbf{a}}_{u}^{t}(w)\cdot{\textbf{a}}_{w}^{t}(v).$

A simple reduction: from $d$ -bounded graphs to $d$ -regular graphs.

Given a graph $G$ with maximum degree upper bounded by $d$ , it will be very convenient to consider the $d$ -regular graph $G^{\prime}$ that is obtained by adding an appropriate number of self-loops (each with half weight) to each vertex so that every vertex has degree exactly $d$ . Note that the (normal) random walk on $G$ we defined above is exactly the lazy random walk of the graph $G^{\prime}$ . Let $\mathbf{A}$ denote the adjacency matrix of $G^{\prime}$ , and let $\mathbf{L}:=\mathbf{I}-\frac{1}{d}\mathbf{A}$ denote the normalized Laplacian matrix of $G^{\prime}$ . We let $0=\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}\leq 2$ denote the eigenvalues of $\mathbf{L}$ and let $\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{n}$ denote the corresponding orthonormal (row) eigenvectors. That is, $\mathbf{L}=\sum_{i}\lambda_{i}\cdot\mathbf{v}^{T}\cdot\mathbf{v}$ . Note that the lazy random walk matrix corresponding to $G^{\prime}$ is $\mathbf{P}:=\frac{\mathbf{I}+\frac{1}{d}\mathbf{A}}{2}=\mathbf{I}-\frac{\mathbf{L}}{2}$ . This implies that the eigenvalues of $\mathbf{P}$ are $1=1-\frac{\lambda_{1}}{2},1-\frac{\lambda_{2}}{2},\cdots,1-\frac{\lambda_{n}}{2}\geq 0$ , with corresponding eigenvectors $\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{n}$ . In particular, $\mathbf{P}=\sum_{i}(1-\frac{\lambda_{i}}{2})\cdot\mathbf{v}^{T}\cdot\mathbf{v}$ . Furthermore, it holds that $\textbf{p}_{v}^{t}=\textbf{1}_{v}\cdot\mathbf{P}^{t}=\sum_{i}(1-\frac{\lambda_{i}}{2})^{t}\cdot\mathbf{v}^{T}\cdot\mathbf{v}$ .

Estimating reduced collision probabilities.

Both our robust clustering oracle and local reconstruction needs to invoke a procedure to estimate the reduced collision probability of two random walks [KPS13]. For a vertex $v$ , an integer $t$ and a constant $\theta\in[0,1]$ , we let $S_{v}^{\theta}=\{u:{\textbf{a}}_{v}^{t}(u)\leq\frac{1-\theta}{\sqrt{n}}\}$ . For any two vertices $u,v$ , the $\theta$ -reduced collision probability of $u,v$ is defined as

[TABLE]

Observe that by definition of ${\textbf{b}}_{v}^{t}$ -random walks, it holds that

[TABLE]

The following lemma shows that under appropriate conditions, the reduced collision probability of two vertices can be well approximated in $\tilde{O}(\sqrt{n})$ time.

Lemma 2.1 ([KPS13]).

Let $\theta<\frac{1}{2},\delta<1$ be two constant. Let $u,v$ be two vertices. There exists a procedure EstimateRCP( $G,u,v,\theta,\delta,t$ ) that takes as input a $d$ -bounded $n$ -vertex graph $G$ , vertices $u,v$ , parameters $\theta,\delta$ , and length parameter $t$ , and satisfies the following properties:

It runs in time $O(\sqrt{n}t\log^{2}n)$ ;

2)

If ${\textbf{a}}_{u}^{t}(S_{u}^{\theta})\geq 1/2,{\textbf{a}}_{v}^{t}(S_{v}^{\theta})\geq 1/2$ , then it aborts (without outputting an estimate) with probability at most $\exp(-\Theta(\sqrt{n}))$ ;

3)

If it does not abort, then with probability at least $1-\frac{1}{n^{4}}$ , it outputs an estimate $\mathrm{rcp}^{\prime}(u,v)$ such that

[TABLE]

For the sake of completeness, we give the description of the algorithm EstimateRCP in Appendix B.

3 Robust Clustering Oracle

In this section, we present our algorithm for constructing the robust clustering oracle and answering the clustering queries. In the preprocessing (or learning) phase, the algorithm learns the cores (corresponding to clusters in the clusterable part) of the graph. In the query phase, the algorithm checks if the queried vertex $v$ belongs to any of the learned cores or not to decide if it is an outlier or not. If not, the algorithm will find the index $i$ corresponding to the cluster that $v$ belongs to.

We will use the reduced collision probability of random walks of length $t=\frac{960\log n}{\kappa\phi^{2}}$ for some sufficiently small constant $\kappa>0$ . Such probabilities can be efficiently estimated by invoking the EstimateRCP procedure (see Section 2). The intuition is that for a typical vertex $u$ in a large cluster $C$ , the uniform averaging walk of $t$ steps from $u$ will be close to the uniform distribution on $C$ (by Theorem 1.5), which implies that for almost all of vertices $v\in C$ , their reduced collision probability is at least $\frac{1-\kappa}{|C|}$ .

The learning phase of the algorithm is as follows.

[TABLE]

The subroutine FindCore( $H,J$ ) is defined as follows.

[TABLE]

Note that by the above definition of cores, it holds that for any core $S_{i}$ , there exists $j_{i}\in\{0,1,\cdots,J\}$ such that $\frac{|S_{i}|}{|S|}\geq(1-\kappa)\tau_{j_{i}}$ and the edge weight in the clique $H[S_{i}]$ is at least $\frac{1-\kappa}{\tau_{j_{i}}}\frac{1}{n}$ .

We need the following subroutine to answer clustering queries.

[TABLE]

Now we are ready to describe our algorithm for answering clustering queries.

[TABLE]

3.1 The Analysis of Robust Clustering Oracle

In the following, we show the performance guarantee of the above algorithm. We will use the local mixing property on noisy clusterable graphs as guaranteed in Theorem 1.5, whose proof is deferred to Section 5. Recall from the description of our algorithm that $\kappa=100\cdot\delta_{0}^{2}$ , which is a sufficiently small universal constant.

If $\varepsilon>\frac{\phi\kappa^{2}}{100}$ (i.e., the noise is too much), then by our algorithm, the learning phase will output fail. Any queried vertex will be reported as Outlier.

In the following, we assume that $\varepsilon\in[\Omega(\frac{\phi}{{n}}),\frac{\phi\kappa^{2}}{100}]$ and we prove the statement of Theorem 1.3. To do so, we first introduce the definition of strong vertices, which correspond to vertices in the clusterable part.

Definition and properties of strong vertices.

Let $\phi\in(0,1),\varepsilon\in[\Omega(\frac{\phi}{{n}}),\frac{\phi\kappa^{2}}{100}]$ . Let $G$ be an $\varepsilon$ -perturbation of a $(k,\phi,\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\kappa^{4}\phi}{3k^{3}\log n})$ -clusterable graph. Recall that ${\textbf{a}}_{v}^{t}$ and ${\textbf{b}}_{v}^{t}$ denote the distribution of the uniform average walk of length $t$ and the uniform average walk of length $t$ with two phases starting from $v$ , respectively. In the algorithm, we invoke EstimateRCP with length parameter $t=\frac{960\log n}{\kappa\phi^{2}}$ .

We let $\varepsilon^{\prime}:=\frac{6\varepsilon}{\phi}<\frac{\kappa^{2}}{100}$ . We introduce the following definition of strong vertex for the analysis, which was inspired by the corresponding definition for noisy expander graphs in [KPS13]. The main difference here is that we carefully take the size of clusters into consideration.

Definition 3.1.

We call a vertex $v$ a strong vertex with respect to a subset $C$ if $v\in C$ , $|C|\geq 3\sqrt{\varepsilon^{\prime}}n$ and $\lVert{\textbf{a}}_{v}^{t}-\mathbf{\mathcal{U}}_{C}\rVert_{\textrm{TV}}\leq\kappa.$

Recall that $\theta_{0}$ is small sufficiently small constant, $S_{v}^{\theta_{0}}=\{u:{\textbf{a}}_{v}^{t}(u)\leq(1-{\theta_{0}})/\sqrt{n}\}$ and that $\mathrm{rcp}_{\theta_{0}}(u,v)=\sum_{w\in S_{u}^{\theta_{0}}\cap S_{v}^{\theta_{0}}}{\textbf{a}}_{u}^{t}(w){\textbf{a}}_{v}^{t}(w)$ is the reduced collision probability of $u,v$ (see Section 2). We have the following properties of strong/weak vertices, which easily follows from the proof of Lemma 2 in [KPS13]. We present the proof in Appendix A for the sake of completeness.

Lemma 3.2.

If a vertex $u$ is strong with respect to a set $C$ with $|C|\geq 3\sqrt{\varepsilon^{\prime}}n$ , then (1) there can be at most $\sqrt{\kappa}|C|$ vertices $v$ in $C$ with ${\textbf{a}}_{u}^{t}(v)\leq(1-\sqrt{\kappa})/|C|$ ; (2) it holds that ${\textbf{a}}_{u}^{t}(S_{u}^{\theta_{0}})\geq 1/2$ .

Furthermore, if vertices $u,v$ are both strong with respect to a set $C$ with $|C|\geq 3\sqrt{\varepsilon^{\prime}}n$ , then we have that $\mathrm{rcp}_{\theta_{0}}(u,v)\geq(1-5\sqrt{\kappa})/|C|$ .

The correctness of the robust clustering oracle.

Now we show the correctness of the robust clustering oracle and bound the total number of vertices reported as outliers by the the algorithm. Recall that we let $P_{i}:=\{u\in V:\textsc{WhichCluster}(u)=i\}$ with $1\leq i\leq h$ for some integer $h$ , and $B:=\{u\in V:\textsc{IsOutlier}(u)=\textbf{Yes}\}$ denote the partition output by our algorithm.

Lemma 3.3.

Let $G$ be an $\varepsilon$ -perturbation of a $(k,\phi,\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\kappa^{4}\phi}{3k^{3}\log n})$ -clusterable graph. Then there exists a partition ${D}_{1},\cdots,{D}_{h^{\prime}},B^{\prime}$ for some $h^{\prime}\leq k$ (that is independent of the order of queries), such that

•

if $\varepsilon\in[\Omega(\frac{\phi}{{n}}),\frac{\phi}{60k^{2}}]$ , then $h^{\prime}\geq 1$ and each $D_{i}$ is a $(\frac{\phi}{2},\frac{a_{\ref{thm:rw_perturbed}}\sqrt{\varepsilon}\kappa^{4}\phi^{1.5}}{3k^{3}\log n})$ -cluster, for any $1\leq i\leq h^{\prime}$ ; if $\varepsilon\in(\frac{\phi}{60k^{2}},1]$ , then $h^{\prime}=0$ ; and

•

with probability at least $1-\frac{1}{n}$ , the partition $P_{1},\cdots,P_{h},B$ output by the algorithm satisfies that $h^{\prime}\leq h\leq k$ and $\sum_{i=1}^{h^{\prime}}|P_{i}\triangle D_{i}|+\sum_{i=h^{\prime}+1}^{h}|P_{i}|+|B|+|B^{\prime}|\leq 40k\sqrt{\varepsilon/\phi}n$ .

In particular, the number of vertices reported as outliers is at most $40k\sqrt{\varepsilon/\phi}n$ .

Proof.

We first note that if $\varepsilon>\frac{\phi}{60k^{2}}$ , then we can simply take $B^{\prime}=V$ (and thus $h^{\prime}=0$ ) and then for any output partition of the algorithm, it holds that $\sum_{i=1}^{h^{\prime}}|P_{i}\triangle D_{i}|+\sum_{i=h^{\prime}+1}^{h}|P_{i}|+|B|+|B^{\prime}|=2n<40k\sqrt{\varepsilon/\phi}n$ .

Thus, in the following, we assume that $\varepsilon\leq\frac{\phi}{60k^{2}}$ .

Let $\phi_{\textrm{out}}=\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\kappa^{4}\phi}{3k^{3}\log n}$ . Let $G^{*}=(V,E^{*})$ be a $(k,\phi,\phi_{\textrm{out}})$ -clusterable graph such that $G$ is an $\varepsilon$ -perturbation of $G^{*}$ . Let $C_{1},\cdots,C_{\overline{h}}$ be the corresponding $(\overline{h},\phi,\phi_{\textrm{out}})$ -clustering of $G^{*}$ for some $\overline{h}\leq k$ . That is, for each $i\leq\overline{h}$ , $\phi_{G}(C_{i})\leq\phi_{\textrm{out}}$ , and one can insert/delete at most $\varepsilon dn$ edges inside subgraphs $G[C_{i}]$ to make all $G[C_{i}]$ become $(\phi,\phi_{\textrm{out}})$ -clusters.

Now for each set $C_{i}$ , we perform the following process on $G[C_{i}]$ recursively. We start with $B_{i}:=\emptyset$ and $D_{i}:=C_{i}$ . If $|B_{i}|\leq\frac{|C_{i}|}{2}$ , and there exists a subset $M_{i}\subseteq D_{i}$ with $|M_{i}|\leq|D_{i}|/2$ and $\phi_{G[C_{i}]}(M_{i})\leq\phi/2$ , then we update $B_{i}=B_{i}\cup M_{i}$ , and $D_{i}=D_{i}\setminus M_{i}$ . We recurse until no such set $M_{i}$ can be found or $|B_{i}|>\frac{|C_{i}|}{2}$ . Note that by our construction, the final set $B_{i}$ satisfies that $\phi_{G[C_{i}]}(B_{i})\leq\phi/2$ and that $D_{i}$ has inner conductance at least $\phi/2$ . Furthermore, it holds that $|B_{i}|\leq\frac{3}{4}|C_{i}|$ , since right before the last update, we have that $|B_{i}^{\prime}|\leq\frac{|C_{i}|}{2}$ and that the final cut $M^{\prime}$ satisfies that $|M_{i}^{\prime}|\leq\frac{1}{2}(|C_{i}-B_{i}^{\prime}|)$ , which gives that $|B_{i}|\leq\frac{1}{2}(|C_{i}-B_{i}^{\prime}|)+|B_{i}^{\prime}|\leq\frac{3}{4}|C_{i}|$ .

Now we claim that $|\cup_{i}B_{i}|\leq\frac{6\varepsilon}{\phi}n$ . Assume on the contrary that $|\cup_{i}B_{i}|>\frac{6\varepsilon}{\phi}n$ , i.e., $\sum_{i}|B_{i}|>\frac{6\varepsilon}{\phi}n$ . First, we note that in order to make $\phi(G[C_{i}])\geq\phi$ , then we should add at least $\frac{\phi}{2}d\min\{|B_{i}|,|C_{i}-B_{i}|\}\geq\frac{\phi}{2}d\cdot\frac{1}{3}|B_{i}|=\frac{\phi}{6}d|B_{i}|$ edges, where the inequality follows from the fact that $|C_{i}-B_{i}|\geq\frac{1}{3}|B_{i}|$ which in turn is due to the fact that $|B_{i}|\leq\frac{3}{4}|C_{i}|$ . Therefore, in order to make all $C_{i}$ have inner conductance at least $\phi$ , we have to add at least $\sum_{i}\frac{\phi}{6}d|B_{i}|>\frac{\phi}{6}d\cdot\frac{6\varepsilon}{\phi}n=\varepsilon dn$ edges, which is a contradiction.

We note that since $\varepsilon\leq\frac{\phi}{60k^{2}}$ , then it holds that at least one $D_{i}$ has size at least $\frac{(1-(6\varepsilon/\phi))n}{k}\geq\frac{9n}{10k}\geq 3\sqrt{\frac{1}{10k^{2}}}n\geq 3\sqrt{\frac{6\varepsilon}{\phi}}n=3\sqrt{\varepsilon^{\prime}}n$ . Now we apply Theorem 1.5 on $G$ with error parameter $\varepsilon^{\prime}=\frac{6\varepsilon}{\phi}<\frac{\kappa^{2}}{100}$ , $\gamma=\frac{\kappa}{2}$ , sets $C_{i}=D_{i}\cup B_{i}$ , $1\leq i\leq\overline{h}$ such that $\phi(G[D_{i}])\geq\frac{\phi}{2}$ , to obtain that for each $D_{i}$ with $|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ , there exists a subset $\widehat{D}_{i}\subseteq D_{i}$ such that $|\widehat{D}_{i}|\geq(1-4\sqrt{\varepsilon^{\prime}})|D_{i}|$ and for any $v\in\widehat{D}_{i}$ , and $t=\frac{960\log n}{\kappa\phi^{2}}$ ,

[TABLE]

This further implies that all vertices in $\widehat{D}_{i}$ are strong with respect to $C_{i}$ , as $|C_{i}|\geq|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ . We also note that for each $D_{i}$ with $|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ , it holds that $\phi_{G}(D_{i})\leq\frac{\phi_{\textrm{out}}\cdot dn}{3\sqrt{\varepsilon^{\prime}}\cdot dn}\leq\frac{a_{\ref{thm:rw_perturbed}}\sqrt{\varepsilon}\kappa^{4}\phi^{1.5}}{3k^{3}\log n}$ . Now we order $D_{i}$ such that $|D_{1}|\geq\cdots\geq|D_{\overline{h}}|$ (breaking ties arbitrarily). Let $h^{\prime}$ be the largest index with $|D_{h^{\prime}}|\geq 3\sqrt{\varepsilon^{\prime}}n$ . Note that $h^{\prime}\geq 1$ . We define the partition $D_{1},\cdots,D_{h^{\prime}},B^{\prime}:=V\setminus(\cup_{i\leq h^{\prime}}D_{i})$ . By definition, it holds that for each $1\leq i\leq h^{\prime}$ , $|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ and $\phi(G[D_{i}])\geq\frac{\phi}{2},\phi_{G}(D_{i})\leq\frac{\phi_{\textrm{out}}\cdot dn}{3\sqrt{\varepsilon^{\prime}}\cdot dn}\leq\frac{a_{\ref{thm:rw_perturbed}}\sqrt{\varepsilon}\kappa^{4}\phi^{1.5}}{3k^{3}\log n}$ . Note that the partition $D_{1},\cdots,D_{h^{\prime}},B^{\prime}$ only depends on $G$ . It holds that $|B^{\prime}|=|\sum_{i}B_{i}|+|\cup_{i:|D_{i}|<3\sqrt{\varepsilon^{\prime}}n}D_{i}|\leq(\varepsilon^{\prime}+3k\sqrt{\varepsilon^{\prime}})n$ .

We further define $D_{g}:=\cup_{1\leq i\leq h^{\prime}}\widehat{D}_{i}$ .

Now we show the following claim.

Claim 3.4.

With probability at least $1-\frac{1}{n}$ , for all vertices $v$ in $D_{g}$ , WhichCluster( $v$ ) will output a unique index $\sigma(i)$ if vertex $v\in\widehat{D}_{i}$ for some injection $\sigma:[h^{\prime}]\rightarrow[k]$ .

Note that the statement of the lemma will then follow from the above claim: Let $h\leq k$ be the largest index output by the algorithm, and let $B,P_{\sigma(1)},\cdots,P_{\sigma(h^{\prime})},P_{j}$ , for $j\in[h]\setminus\{\sigma(1),\cdots,\sigma(h^{\prime})\}$ be the partition output by the algorithm. Then by Claim 3.4, all vertices in $D_{g}$ will be correctly partitioned and

[TABLE]

Re-arranging the order of sets $D_{1},\cdots,D_{i}$ will complete the proof of the Lemma. Now we prove the claim.

Proof of Claim 3.4.

By the previous analysis, we have that for each $i$ such that $|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ , the number of vertices in $C_{i}$ that are not strong (with respect to $C_{i}$ ) is at most

[TABLE]

That is, for each $i$ such that $|C_{i}|\geq|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ , at least $(1-\frac{\kappa}{2})$ fraction of vertices in $C_{i}$ are strong (with respect to $C_{i}$ ).

Now let us consider the sample set $S$ . Recall that $|S|=\frac{c\cdot\log n}{\sqrt{\varepsilon/\phi}}=\Omega(\frac{\log n}{\sqrt{\varepsilon^{\prime}}})$ for some large constant $c>0$ . Let $T_{i}=S\cap C_{i}$ and let $S_{i}^{\prime}\subset T_{i}$ denote the set of vertices in $T_{i}$ that are strong with respect to $C_{i}$ . By Chernoff bound, we have that with probability at least $1-1/n^{4}$ , for any $i$ such that $|C_{i}|\geq|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ ,

[TABLE]

In the following, we will condition on event that the above two inequalities hold.

Now recall that $\tau_{j}=3\sqrt{\varepsilon^{\prime}}(1+\frac{\kappa}{2})^{j}$ , for $0\leq j\leq J$ , where $J=O(\frac{\log(\phi/\varepsilon)}{\kappa})$ is the maximum integer $j$ such that $\tau_{j}\leq 1$ . Let $j_{i}$ denote the index such that $|C_{i}|\in[\tau_{j_{i}}n,\tau_{j_{i}+1}n)$ . Thus, $|{S}_{i}^{\prime}|\geq(1-\kappa)\tau_{j_{i}}|S|$ .

Let $v,u$ be two vertices in ${S}_{i}^{\prime}$ . By Lemma 3.2, we have that ${\textbf{a}}_{u}^{t}(S_{u}^{\theta_{0}})\geq 1/2,{\textbf{a}}_{v}^{t}(S_{v}^{\theta_{0}})\geq 1/2$ , and $\mathrm{rcp}_{\theta_{0}}(u,v)\geq(1-5\sqrt{\kappa})/|C_{i}|$ . By the assumption that $\kappa=100\cdot\delta_{0}^{2}$ and Lemma 2.1, we obtain that with probability at least $1-\frac{1}{n^{4}}-\exp(-\Theta(\sqrt{n}))$ , EstimateRCP $(G,u,v,\theta_{0},\delta_{0},t)$ will output a value $\mathrm{rcp}^{\prime}(u,v)$ that is at least $(1-5\sqrt{\kappa})(1-\frac{\sqrt{\kappa}}{10})/|C_{i}|>\frac{1-\kappa/2}{|C_{i}|}>\frac{1-\kappa/2}{\tau_{j_{i}+1}n}\geq\frac{1-\kappa}{\tau_{j_{i}}n}$ . That is, with probability at least $1-\frac{1}{n^{3}}$ , in the similarity graph $H$ , the induced subgraph $H[{S}_{i}^{\prime}]$ will form a complete graph with at least $(1-\kappa)\tau_{j_{i}}|S|$ vertices such that for each pair $u,v\in{S}_{i}^{\prime}$ , $\mathrm{rcp}^{\prime}(u,v)\geq\frac{1-\kappa}{\tau_{j_{i}}n}$ . Therefore, in our sample, the set $S_{i}^{\prime}$ will be recognized as a subgraph of a core (corresponding to $C_{i}$ ), which is a maximal clique with edge weight at least $\frac{1-\kappa}{\tau_{j_{i}}n}$ .

Now once a vertex $v\in\widehat{D}_{i}$ is queried (for checking if it is outlier or not), then by using similar argument as above, we can guarantee that with probability at least $1-\frac{1}{n^{3}}$ , for all $u\in S_{i}^{\prime}$ , the EstimateRCP will output $\mathrm{rcp}^{\prime}(u,v)$ satisfying that $\mathrm{rcp}^{\prime}(u,v)\geq\frac{1-\kappa}{\tau_{j_{i}}n}$ . Thus, the algorithm will detect the core (corresponding to $C_{i}$ ) for $v$ . Furthermore, for any vertex $v$ that is strong with respect to $C_{i}$ , it holds that for any $C_{j}$ with $j\neq i$ , there can be at most $\kappa|C_{j}|$ vertices $u\in C_{j}$ with $\mathrm{rcp}_{\theta_{0}}(u,v)>\frac{1-5\sqrt{\kappa}}{|C_{j}|}$ , this is true since the total probability mass on $C_{j}$ of the random walk distribution from $v$ is at most $\kappa$ . This ensures that there will be a unique core corresponding to $v$ . Let $\sigma:[h^{\prime}]\rightarrow[h^{\prime}]$ denote the corresponding bijection between $\{C_{1},\cdots,C_{h^{\prime}}\}$ and the cores $\{S_{1},\cdots,S_{h^{\prime}}\}$ found by the algorithm. By union bound, we have that with probability at least $1-\frac{1}{n}$ , for each strong vertex $v\in S_{i}^{\prime}$ , the algorithm will answer the corresponding index $\sigma(i)$ to the query WhichCluster( $v$ ).

Running time and query complexity.

Note that in the learning phase, we need to invoke the procedure EstimateRCP for $|S|\times|S|=O(\frac{k^{4}\ln^{2}k\cdot\phi\log^{2}n}{\varepsilon})$ times, and each invocation takes time $O(\sqrt{n}t\log^{2}n)$ , which in total takes time $O(\sqrt{n}\frac{\log n}{\phi^{2}}\cdot\frac{k^{4}\ln^{2}k\cdot\phi\log^{2}n}{\varepsilon})=O(\sqrt{n}\frac{k^{4}\ln^{2}k\cdot\log^{3}n}{\phi\varepsilon})$ . Finding the cores in the similarly graph can be implemented by a simple greedy algorithm？？？？, which can be implemented in $O(\textrm{poly}(|S|))=O(\textrm{poly}(\frac{k\cdot\phi\log n}{\varepsilon}))$ time. Thus, the query complexity and running time in the learning phase is dominated by $O(\sqrt{n}\cdot\textrm{poly}(\frac{k\cdot\log n}{\varepsilon\phi}))$ , which, by similar arguments, also upper bounds the query complexity and running time on each query vertex $w$ in the query phase.

Remark.

From Lemma 3.3 and its proof, we note that in order to guarantee that $h^{\prime}\geq 1$ , i.e., there exists at least one good cluster $D_{i}$ , we need to set $\varepsilon=O(\frac{\phi}{k^{2}})$ (so that $(1-\varepsilon^{\prime})n/k\geq 3\sqrt{\varepsilon^{\prime}}n$ ). Thus our algorithm has non-trivial guarantee only if the adversary does not perturb the graph too much. Suppose that there are $h\leq k$ ground-truth clusters $C_{1},\cdots,C_{h}$ and the adversary perturbs an $\varepsilon$ -fraction on intra-cluster edges. In order to recover for each $C_{i}$ , a subset $P_{i}$ that is close to $D_{i}\subseteq C_{i}$ , then we need to require that $\min_{i\in h}|D_{i}|\geq 3\sqrt{\varepsilon^{\prime}}n$ , which can be satisfied if $\varepsilon=O(\phi\cdot(\frac{\min_{i\in h}|D_{i}|}{n})^{2})$ .

We further remark that our algorithm can only be able to (partially) recover the large clusters, say of size at least $\Omega(\varepsilon n)$ . This is the case as for any small cluster (of size $o(\varepsilon n)$ ), it can be completely hidden or destroyed by the adversary. Currently, our analysis shows that our algorithm can recover the cluster of size $\Omega(\sqrt{\frac{\varepsilon}{\phi}}n)$ . It will be an interesting question to design a robust clustering oracle that can recover smaller clusters (i.e., of size in the range $[\Omega(\varepsilon n),o(\sqrt{\frac{\varepsilon}{\phi}}n)]$ ).

4 The Local Reconstruction Algorithm

In this section, we present our reconstruction algorithm, which will be built upon our robust clustering oracle algorithm in Section 3 and consists of two phases: the learning phase, that learns the cores (corresponding to clusters in the clusterable part) of the graph, and the query phase, which first checks if the queried vertex belongs to any of the learned cores or not, and then output its neighbors in the amended clusterable graph accordingly. We need the following tool of explicit construction of expanders.

Explicit expanders.

For any vertex set $V=[n]$ , we let $G_{\exp}=(V,E_{\exp})$ denote a graph on $V$ with maximum degree at most $16$ such that for any set $S$ in $G_{\exp}$ with $|S|\leq n/2$ , it holds that $|E_{\exp}(S,V\setminus S)|\geq\eta|S|$ , for some constant $\eta>0$ . It is known (see e.g., Lemma 6 in [KPS13] which builds upon [GG81]) that such an expander $G_{\exp}$ also exists and can be explicitly constructed in the sense that for any specified vertex $v$ , one can find all neighbors of $v$ in $G_{\exp}$ in $\textrm{poly}(\log n)$ time.

In the following, given a graph $G$ , we let $G_{\exp}$ denote an explicit expander graphs on the same vertex set as $G$ . We call vertices $G$ or $G_{\exp}$ -neighbors of a vertex $v$ , depending on the graph under consideration.

[TABLE]

Note that the algorithm should be implemented by first taking as input a random seed $s$ , which is fixed once for all (and used for sampling vertices in the learning phase and performing random walks), and then on any query vertex $v$ , deterministically outputting the neighborhood of $v$ in the graph $G^{\prime}$ . By construction, if an edge $(u,v)$ is added, then on query vertex $u$ , $v$ will be output as a neighbor of $u$ and vice versa. Therefore, the algorithm is independent of the order of queries and the answer will be globally consistent.

4.1 Analysis of the Local Reconstruction Algorithm

In the following, we show the performance guarantee of the above algorithm and prove Theorem 1.4. We first note that the running time and query complexity can be analyzed in the same way as in the proof Theorem 1.3.

It follows from the definition of $G_{\exp}$ that the maximum degree of $G^{\prime}$ is bounded by $d+16$ , as $G_{\exp}$ has maximum degree at most $16$ and for each vertex $u$ that is found to be an outlier, we will add all of its $G_{\exp}$ -neighbors to $u$ .

Recall from the description of our algorithm that $\kappa>0$ is a sufficiently small universal constant. If $\varepsilon>\frac{\phi\kappa^{2}}{100}$ (i.e., the noise is too much), then by our algorithm, the learning phase will output fail. Furthermore, on query any vertex $u$ , the query phase will output all of its $G$ and $G_{\exp}$ neighbors of $u$ . Thus, $G^{\prime}$ is a complete hybridization of $G$ and $G_{\exp}$ . Note that for any set $S\subset V$ , $|E^{\prime}(S,\bar{S})|\geq|E_{\exp}(S,\bar{S})|$ , where $E^{\prime}$ and $E_{\exp}$ denote the set of edges in $G^{\prime}$ and $G_{\exp}$ respectively. Thus, it holds that if $|S|\leq\frac{n}{2}$ , $\phi_{G^{\prime}}(S)=\frac{|E_{\exp}(S,\bar{S})|}{d|S|}\geq\frac{\eta}{d}$ , where we used the fact that for any set $S$ with $|S|\leq\frac{n}{2}$ in $G_{\exp}$ , $|E_{\exp}(S,\bar{S})|\geq\eta|S|$ . Therefore, the resulting graph $G^{\prime}$ is $(1,\frac{\eta}{d},0)$ -clusterable. Furthermore, the number of edges added to $G$ is at most $16n/2=8n=O(\min\{1,k\sqrt{\varepsilon/\phi}\}\cdot n)$ as $\varepsilon>\frac{\phi\kappa^{2}}{100}$ . Thus, in this case, the statement of our theorem holds.

In the following, we prove the rest properties as listed in Theorem 1.4 for the more interesting case that $\varepsilon\in[\Omega(\frac{\phi}{{n}}),\frac{\phi\kappa^{2}}{100}]$ .

In this case, the description of the local reconstruction algorithm, the number of added edges is $16$ times the number of vertices that are reported as outliers, and thus by Lemma 3.3, is at most $16\times 40k\sqrt{\frac{\varepsilon}{\phi}}n=640k\sqrt{\frac{\varepsilon}{\phi}}n$ . Now we analyze the cluster structure of the resulting graph.

Definition and property of weak vertices.

Let $\varepsilon^{\prime}:=\frac{6\varepsilon}{\phi}<\frac{\kappa^{2}}{100}$ . We introduce the following definitions of weak vertex for the analysis, which was inspired by the corresponding definitions for noisy expander graphs in [KPS13]. The main difference here is that we carefully take the size of clusters into consideration.

Definition 4.1.

We call a vertex $v$ weak vertex, if for any subset $A$ with $|A|\geq\frac{2\varepsilon^{\prime}}{3}n$ , it holds that $\lVert\textbf{b}_{v}^{t}-\mathbf{\mathcal{U}}_{A}\rVert_{\textrm{TV}}\geq 1/4.$

In order to analyze the cluster structure of the resulting graph $G^{\prime}$ , we need the following property of weak vertices.

Lemma 4.2.

With probability at least $1-n^{-3}$ , it holds that for any weak vertex $u$ , the algorithm will report $u$ as an outlier.

Proof.

We first show that if $u$ is weak, then for any subset $A$ with $|A|\geq\frac{2\varepsilon^{\prime}}{3}n$ vertices, at most $7/8|A|$ vertices $v$ in $A$ satisfy ${\textbf{b}}_{u}^{t}(v)\geq\frac{7/8}{|A|}$ . This is true since otherwise, there will be more than $7/8|A|$ vertices $v$ satisfy ${\textbf{b}}_{u}(v)\geq\frac{7/8}{|A|}$ . If we let $A_{1}\subseteq A$ (resp. $A_{2}\subseteq A$ ) denote the set of vertices $v$ in $A$ such that ${\textbf{b}}_{u}^{t}(v)\leq\frac{1}{|A|}$ (resp. ${\textbf{b}}_{u}^{t}(v)>\frac{1}{|A|}$ ), then

[TABLE]

which is a contradiction. By the definitions of reduced collision probability $\mathrm{rcp}_{\theta_{0}}(u,v)$ and relations of ${\textbf{a}}_{u}^{t}$ and ${\textbf{b}}_{u}^{t}$ , we have that $\mathrm{rcp}_{0}(u,v)\leq{\textbf{b}}_{u}^{t}(v)$ , and thus there can be at most $\frac{7}{8}|A|$ vertices $v$ in $A$ with $\mathrm{rcp}_{0}(u,v)\geq\frac{7}{8|A|}$ . Note that this property holds for all sets $A$ with $|A|\geq\frac{2\varepsilon^{\prime}}{3}n$ .

For each $0\leq j\leq J$ , we let $T_{j}$ denote the set of vertices $v$ such that $\mathrm{rcp}_{0}(u,v)\geq\frac{7}{8\tau_{j}n}$ . Recall that $\tau_{j}=3\sqrt{\varepsilon^{\prime}}(1+\frac{\kappa}{2})^{j}$ , for $0\leq j\leq J$ .

If $|T_{j}|\geq\tau_{j}n>\frac{2\varepsilon^{\prime}}{3}n$ , then for all vertices $v\in T_{j}$ , it holds that $\mathrm{rcp}_{0}(u,v)\geq\frac{7}{8\tau_{j}n}\geq\frac{7}{8|T_{j}|}$ , which is a contradiction. If $\frac{7\tau_{j}}{8}n<|T_{j}|<\tau_{j}n$ , then we can add arbitrarily at most $\frac{1}{8}\tau_{j}n$ vertices to $T_{j}$ to obtain a set $A$ such that $|A|=\tau_{j}n>\frac{2\varepsilon^{\prime}}{3}n$ , and for at least $\frac{|T_{i}|}{|A|}>\frac{7}{8}$ fraction of vertices $v$ in $A$ , it holds that $\mathrm{rcp}_{0}(u,v)\geq\frac{7}{8\tau_{j}n}=\frac{7}{8|A|}$ , which is a contradiction. Therefore, it must hold that $|T_{j}|\leq\frac{7\tau_{j}n}{8}$ .

That is, for the weak vertex $u$ , it holds that for each $0\leq j\leq J$ , there will be at most $\frac{7}{8}\tau_{j}n$ vertices $v$ with $\mathrm{rcp}_{0}(u,v)\geq\frac{7}{8\tau_{j}n}$ . Thus, there will be at least $(1-\frac{7}{8}\tau_{j})n$ vertices $v$ with $\mathrm{rcp}_{0}(u,v)\leq\frac{7}{8\tau_{j}n}$ . We can further guarantee that with probability at least $1-1/n^{2}$ , for any such pair $u,v$ , the procedure EstimateRCP (with parameter $\delta\leq\frac{\sqrt{\kappa}}{10}$ ) either aborts or outputs an estimate $\mathrm{rcp}^{\prime}(u,v)\leq(1+\frac{\sqrt{\kappa}}{10})\frac{7}{8\tau_{j}n}\leq\frac{8}{9\tau_{j}n}$ , for any $0\leq j\leq J$ . Finally, with probability at least $1-\frac{2}{n^{2}}$ , in our sample set $S$ , at least $(1-\frac{7}{8}\tau_{j})$ fraction of vertices $v$ satisfy that $\mathrm{rcp}^{\prime}(u,v)\leq\frac{8}{9\tau_{j}n}$ , or equivalently, less than $\frac{7}{8}\tau_{j}$ fraction of vertices $v$ satisfy that $\mathrm{rcp}^{\prime}(u,v)\geq\frac{8}{9\tau_{j}n}$ . This implies that our algorithm will report $u$ as an outlier.

Cluster structure of $G^{\prime}$ .

Now we are ready to show that the resulting graph $G^{\prime}$ from our local reconstruction algorithm can be partitioned into at most $k$ parts, each of which has relatively large inner conductance.

Lemma 4.3.

Let $\phi^{*}=\frac{a_{\ref{lemma:conductance}}\varepsilon\phi}{k^{4}\log n}$ for some sufficiently small constant $a_{\ref{lemma:conductance}}$ . If $G$ is an $\varepsilon$ -perturbation of a $(k,\phi,\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\kappa^{4}\phi}{3k^{3}\log n})$ -clusterable graph, then the resulting graph $G^{\prime}$ from the local reconstruction algorithm is $(k,\phi^{*},1)$ -clusterable.

Proof.

For analysis, we perform the following procedure on the input graph $G$ . Let $\gamma=\frac{\varepsilon^{\prime}}{3}=\frac{2\varepsilon}{\phi}$ . We start with the set $U:=V$ and a partitioning $\mathcal{P}:=\{V\}$ of $G$ . Then if there exists a set $U\in\mathcal{P}$ and $S\subseteq U$ such that $\gamma n\leq|S|\leq\frac{|U|}{2}$ and $\phi_{G}(S)\leq\phi_{\textrm{out}}:=\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\kappa^{4}\phi}{3k^{3}\log n}$ , then we set $\mathcal{P}=(\mathcal{P}\setminus\{U\})\cup\{S,U\setminus S\}$ . We repeat until no such $S$ can be found. Let $\mathcal{P}=\{C_{1},\cdots,C_{h}\}$ denote the final partitioning of $V$ .

Note that for any $C_{i}$ , if $U$ is the subset that contains $C_{i}$ and is then split into $C_{i}$ and $U\setminus C_{i}$ , then $|U|\geq 2\gamma n$ and thus $|C_{i}|\geq\gamma n$ and $|U\setminus C_{i}|\geq\frac{|U|}{2}\geq\gamma n$ by the construction. This implies that at the end of the above procedure, it holds that $\min_{i}|C_{i}|\geq\gamma n$ .

We further note that $|\mathcal{P}|=h\leq k$ . This is true since otherwise, in order to make $G$ become a $(k,\phi,\phi_{\textrm{out}})$ -clusterable graph, one has to patch up at least one set $C_{i}$ to other parts, that is, we need to add at least $\frac{3\phi}{4}\cdot d\min_{i}\{|C_{i}|\}\geq\frac{3\phi}{4}\cdot d\cdot\frac{2\varepsilon}{\phi}n>\varepsilon dn$ edges, which is a contradiction to the assumption that $G$ is an $\varepsilon$ -perturbation of a $(k,\phi,\phi_{\textrm{out}})$ -clusterable graph.

Now let us consider the partition $\mathcal{P}$ in the constructed graph $G^{\prime}$ . Observe that by the description of our algorithm, for any set $S$ of vertices $|E^{\prime}(S,\bar{S})|\geq|E(S,\bar{S})|$ , where $E^{\prime}$ and $E$ denote the set of edges in $G^{\prime}$ and $G$ respectively. In particular, Lemma 4.2 implies that the set of $G^{\prime}$ -neighbors of any weak vertex $u$ is a superset of the set of $G$ -neighbors of $u$ , as $u$ will be reported as an outlier by the algorithm and the $G_{\exp}$ -neighbors of $u$ will be added to $G^{\prime}$ .

We have the following claim.

Claim 4.4.

In the graph $G^{\prime}$ , for each $C_{i}$ , and any subset $S\subset C_{i}$ with $|S|\leq\frac{|C_{i}|}{2}$ , it holds that $\phi_{G^{\prime}}(S)\geq k\phi^{*}$ .

Proof.

If $\gamma n\leq|S|\leq\frac{|C_{i}|}{2}$ , then by our construction of $C_{i}$ , we have that $\phi_{G}(S)\geq\phi_{\textrm{out}}$ . Thus, $\phi_{G^{\prime}}(S)=\frac{|E^{\prime}(S,V\setminus S)|}{d|S|}\geq\frac{|E(S,V\setminus S)|}{d|S|}\geq\phi_{\textrm{out}}\geq k\phi^{*}$ . Now let us consider the case that $|S|\leq\gamma n$ .

If there are less than $(1-\frac{\eta}{2})$ fraction of vertices in $S$ are weak, then we show that $\phi_{G}(S)\geq k\phi^{*}$ . Suppose this is not the case, that is, $\phi_{G}(S)<\frac{a_{\ref{lemma:conductance}}\varepsilon\phi}{k^{3}\log n}\leq\frac{\eta}{16t}$ , if we set $a_{\ref{lemma:conductance}}$ to be a sufficiently small constant. By the proof of Theorem 4 in [KPS13] (which in turn is based on the proof of Lemma 4.7 in [CS10]), we know that for at least $(1-\eta/2)$ fraction of vertices $u$ in $S$ , the probability that a ${\textbf{b}}_{u}^{t}$ -random walk that starts at $u$ will end up in $\bar{S}$ is at most $1/4$ . Now let $A$ be any set with $|A|\geq\frac{2\varepsilon^{\prime}}{3}n$ . Since $|S|\leq\frac{\varepsilon^{\prime}}{3}n$ , it holds that $|A\setminus S|\geq\frac{1}{2}|A|$ . Thus, we have that $\mathbf{\mathcal{U}}_{A}(A\setminus S)\geq\frac{1}{2}$ . This gives that $\lVert{\textbf{b}}_{u}^{t}-\mathbf{\mathcal{U}}_{A}\rVert_{\textrm{TV}}\geq\frac{1}{4}$ , which implies that such a vertex $u$ is weak. Thus, $S$ contains at least $(1-\eta/2)$ fraction of weak vertices, which is a contradiction. This implies that $\phi_{G^{\prime}}(S)\geq\phi_{G}(S)\geq k\phi^{*}$ .

If there are more than $(1-\eta/2)$ fraction of weak vertices, denoted by $W$ , in $S$ , then the number of $G_{\exp}$ -neighbors of $W$ in $G_{\exp}$ is at least $\eta|W|$ . Since all these $G_{\exp}$ -neighbors are also in $G^{\prime}$ , we have that the number of vertices outside of $S$ is at least $\eta|W|-|S\setminus W|\geq\eta(1-\eta/2)|S|-\eta/2|S|\geq\frac{\eta}{6}|S|$ . Since we add all the edges in $G_{\exp}$ that are incident to $W$ to $G^{\prime}$ , we have that the number of edges crossing $S$ in $G^{\prime}$ is at least $\frac{\eta}{6}|S|$ , and thus $\phi_{G^{\prime}}(S)\geq\frac{\eta}{6d}\geq k\phi^{*}$ .

Now based on the partition $\mathcal{P}=\{C_{1},\cdots,C_{h}\}$ as constructed above, we find a new partition of $G^{\prime}$ such that each part has large inner conductance. We start with the partition $\mathcal{P}=\{C_{1},\cdots,C_{h}\}$ as constructed above and perform the following operations. If there exist $i,j\leq h$ , $S\subseteq C_{i}$ satisfies that $i\neq j$ , $|S|\leq\frac{|C_{i}|}{2}$ and that $|E^{\prime}(S,C_{i}\setminus S)|<|E^{\prime}(S,C_{j})|$ , then we set $C_{i}:=C_{i}\setminus S$ and $C_{j}:=T_{j}\cup S$ . We repeat until the condition is violated.

Note that the above process always terminates in a finite number of steps since the number of crossing edges, i.e., $\sum_{i\neq j}|E^{\prime}(C_{i},C_{j})|$ , always decreases in each iteration. Furthermore, we observe that at the end of the process, for any $1\leq i\leq h$ , and any set $S\subseteq C_{i}$ with $|S|\leq\frac{|C_{i}|}{2}$ , $|E^{\prime}(S,C_{i}\setminus S)|\geq\frac{|E^{\prime}(S,V\setminus S)|}{k}$ . Therefore, $\phi_{G^{\prime}[C_{i}]}(S)\geq\frac{1}{k}\phi_{G^{\prime}}(S)\geq\phi^{*}$ . This implies that for each $i$ , $\phi(G^{\prime}[C_{i}])\geq\phi^{*}$ .

5 Local Mixing Property of Random Walks on Noisy Clusterable Graphs: Proof of Theorem 1.5

In this section, we give the proof of Theorem 1.5. To do so, we first give a property of random walks on clusterable graphs (without noise).

5.1 Local Mixing Property of Random Walks on Clusterable Graphs

We will first prove a mixing property of random walks on a clusterable graph, which says that in a clusterable graph, a random walk of appropriate length starting from a typical vertex of a large cluster will mix well inside the corresponding cluster. By a simple reduction (see Section 2), it suffices to consider a corresponding weighted $d$ -regular graph for any $d$ -bounded graph.

Theorem 5.1.

Let $0<\alpha,\beta,\xi\leq 1$ . Let $\phi_{\textrm{out}}\leq a_{\ref{thm:rw_clusterable}}\frac{\xi{\alpha\beta}\phi_{\textrm{in}}^{2}}{k^{3}\log n}$ for some sufficiently small constant $a_{\ref{thm:rw_clusterable}}>0$ . Let $G$ be a weighted $d$ -regular and $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph with underlying clusters $C_{1},\cdots,C_{h}$ for some $h\leq k$ . Then for each $C_{i}$ with $|C_{i}|\geq\alpha n$ , there exists a subset $C^{\prime}_{i}\subseteq C_{i}$ such that $|C_{i}^{\prime}|\geq(1-\beta)|C_{i}|$ , and for any $v\in C^{\prime}_{i}$ , and $t=\frac{20\log n}{\phi_{\textrm{in}}^{2}}$ , it holds that

[TABLE]

We remark that [ST13] and [AOPT16] gave analysis for upper bounding the probability that a random walk of length $t$ from a typical vertex $v$ in a set $S$ with small conductance will escape the set $S$ , and lower bounding the probability that the walk from $v$ of length $t$ stays inside $S$ , respectively. It is unclear if one can use their analysis to prove the above theorem. In the following, we prove Theorem 5.1 by using some strong spectral property of clusterable graphs, i.e., the spectral gap between $\lambda_{h+1}$ and $\lambda_{h}$ for some $h\leq k$ , and the closeness of the space spanned by the first $h$ eigenvectors and the space spanned by the indicator vectors of clusters. More precisely, we need the following tools.

Lemma 5.2 (Lemma 5.2 in [CPS15] and Lemma 10 in [CKK+18]).

Let $G$ be a weighted $d$ -regular and $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph with underlying clusters $C_{1},\cdots,C_{h}$ for some $h\leq k$ . Then $\lambda_{h}\leq 2\phi_{\textrm{out}}$ and $\lambda_{h+1}\geq\frac{\phi_{\textrm{in}}^{2}}{2}$ .

Fact 5.3.

It holds that $\lVert\textbf{1}_{v}\rVert_{2}^{2}=\sum_{j=1}^{n}\mathbf{v}_{j}(v)^{2}=1$ , for any $v\in V$ .

The following is a direct corollary of a structural result due to [PSZ17] that relates the first $k$ eigenvectors of the Laplacian to the normalized indicator vectors of some $k$ -partition of the graph. Recall that $\mathbf{v}_{i}$ is the eigenvector corresponding to the $i$ -th smallest eigenvalue of the Laplacian of $G$ .

Theorem 5.4.

Let $\phi_{\textrm{out}}\leq a_{\ref{thm:PSZ_structure}}\phi_{\textrm{in}}^{2}/k^{2}$ for sufficiently small constant $a_{\ref{thm:PSZ_structure}}>0$ . Let $G$ be a weighted $d$ -regular and $(k,\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusterable graph with underlying $(\phi_{\textrm{in}},\phi_{\textrm{out}})$ -clusters $C_{1},\cdots,C_{h}$ for some $h\leq k$ . Let $\mathbf{r}_{i}:=\frac{1}{\sqrt{|C_{i}|}}\cdot\textbf{1}_{C_{i}}$ . Then there exist $h$ orthonormal vectors $\tilde{\mathbf{r}}_{1},\cdots,\tilde{\mathbf{r}}_{h}\in\textrm{span}(\mathbf{r}_{1},\cdots,\mathbf{r}_{h})$ and a constant $c_{\ref{thm:PSZ_structure}}>0$ , such that

[TABLE]

Proof.

Let $\rho(h):=\min_{A_{1},\cdots,A_{h}}\max\{\phi_{G}(A_{i}):i=1,\cdots,h\}$ , where the minimum is taken over all $h$ -partitions $A_{1},\cdots,A_{h}$ . It is proven in Theorem 1.1 of [PSZ17] that if $\lambda_{h+1}/\rho(h)\geq ch^{2}$ for some constant $c>0$ , then there exist orthonormal vectors $\tilde{\mathbf{r}}_{1},\cdots,\tilde{\mathbf{r}}_{h}\in\textrm{span}(\mathbf{r}_{1},\cdots,\mathbf{r}_{h})$ such that $\lVert\mathbf{v}_{i}-\tilde{\mathbf{r}}_{i}\rVert_{2}^{2}\leq 1.1h\cdot\frac{\rho(h)}{\lambda_{h+1}}.$

Note that by definition, $\rho(h)\leq\phi_{\textrm{out}}$ . In addition, by Lemma 5.2, it holds that $\lambda_{h+1}\geq\frac{\phi_{\textrm{in}}^{2}}{2}$ . Furthermore, since $\phi_{\textrm{out}}\leq a_{\ref{thm:PSZ_structure}}\phi_{\textrm{in}}^{2}/k^{2}\leq a_{\ref{thm:PSZ_structure}}\phi_{\textrm{in}}^{2}/{h^{2}}$ , it holds that $\lambda_{h+1}/\rho(h)\geq\frac{\phi_{\textrm{in}}^{2}}{2\phi_{\textrm{out}}}=ch^{2}$ as $a_{\ref{thm:PSZ_structure}}$ is sufficiently small constant. This then implies that $\lVert\mathbf{v}_{i}-\tilde{\mathbf{r}}_{i}\rVert_{2}^{2}\leq 1.1h\cdot\frac{\rho(h)}{\lambda_{h+1}}\leq c_{\ref{thm:PSZ_structure}}\cdot\frac{h\phi_{\textrm{out}}}{\phi_{\textrm{in}}^{2}}$ for some constant $c_{\ref{thm:PSZ_structure}}$ .

Now we are ready to prove Theorem 5.1. We first provide a high level idea. We will bound the $\ell_{2}$ -norm distance of the random walk distribution $\textbf{p}_{v}^{t}$ and the uniform distribution $\mathbf{\mathcal{U}}_{C}$ over the cluster $C$ that contains $v$ , i.e., $\lVert\textbf{p}_{v}^{t}-\mathbf{\mathcal{U}}_{C}\rVert_{2}$ . In order to do so, we note that by Theorem 5.4, the vector $\mathbf{\mathcal{U}}_{C}$ , which is a scale of the indicator vector of $C$ , lies in a space that can be well approximated by the space of the first $h$ (where $h\leq k$ is the number of clusters) eigenvectors of matrix $\mathbf{P}$ . Using this, we show that the projection of $\textbf{p}_{v}^{t}-\mathbf{\mathcal{U}}_{C}$ on the space spanned by the first $h$ eigenvectors is small. Furthermore, by Lemma 5.2, $\lambda_{h+1}$ is large, and thus the length of the projection of $\textbf{p}_{v}^{t}-\mathbf{\mathcal{U}}_{C}$ on the space spanned by the remaining $n-h$ eigenvectors is dominated by $(1-\frac{\lambda_{h+1}}{2})^{O(t)}$ , which is also small for appropriately chosen $t$ . Now we give the details.

Proof of Theorem 5.1.

For any vertex $v$ , we let $X_{v}:=\sum_{j=1}^{h}\mathbf{v}_{j}(v)^{2}$ . We first note that $\sum_{v\in V}X_{v}=\sum_{v\in V}\sum_{j=1}^{h}\mathbf{v}_{j}(v)^{2}=\sum_{j=1}^{h}\lVert\mathbf{v}_{j}\rVert_{2}^{2}=h$ . Therefore, by the averaging argument, there can be at most $\frac{\beta\alpha}{2}n$ vertices $v$ with $X_{v}\geq\frac{2h}{\beta\alpha n}$ .

Note that by the precondition of the Theorem, it holds that $\phi_{\textrm{out}}\leq a_{\ref{thm:PSZ_structure}}\phi_{\textrm{in}}^{2}/k^{2}$ . Let $\mathbf{r}_{i}$ and $\tilde{\mathbf{r}}_{i}$ be the vectors as defined in Theorem 5.4. Let $Y_{v}:=\sum_{j=1}^{h}(\mathbf{v}_{j}(v)-\tilde{\mathbf{r}}_{j}(v))^{2}$ . Then by applying Theorem 5.4 with graph $G$ , we have that

[TABLE]

Again, by the averaging argument, there can be at most $\frac{\beta\alpha}{2}n$ vertices $v$ with $Y_{v}\geq c_{\ref{thm:PSZ_structure}}\cdot\frac{h^{2}\phi_{\textrm{out}}}{\phi_{\textrm{in}}^{2}}\frac{2}{\beta\alpha n}$ .

Now let us define $C^{\prime}_{i}:=\{v:v\in C_{i},X_{v}\leq\frac{2h}{\beta\alpha n},Y_{v}\leq c_{\ref{thm:PSZ_structure}}\cdot\frac{h^{2}\phi_{\textrm{out}}}{\phi_{\textrm{in}}^{2}}\frac{2}{\beta\alpha n}\}$ . Note that for any $C_{i}$ with $|C_{i}|\geq\alpha n$ , it holds that $|C^{\prime}_{i}|\geq|C_{i}|-(\frac{\beta\alpha}{2}+\frac{\beta\alpha}{2})n\geq|C_{i}|-\beta|C_{i}|\geq(1-\beta)|C_{i}|$ .

Let us consider any vertex $v\in C_{i}^{\prime}$ . Since $\mathbf{r}_{i}=\frac{\textbf{1}_{C_{i}}}{\sqrt{|C_{i}|}}$ , it holds that

[TABLE]

where the last equation follows from the fact that $\tilde{\mathbf{r}}_{1},\cdots,\tilde{\mathbf{r}}_{h}$ have the same linear span as vectors $\mathbf{r}_{1},\cdots,\mathbf{r}_{h}$ , which in turn follows from the properties of $\{\tilde{\mathbf{r}}_{i}\}$ as guaranteed by Theorem 5.4.

Recall that $\textbf{p}_{v}^{t}=\sum_{j=1}^{n}(1-\frac{\lambda_{j}}{2})^{t}\mathbf{v}_{j}(v)\cdot\mathbf{v}_{j}$ . We let $t=\frac{20\log n}{\phi_{\textrm{in}}^{2}}$ . Thus, we have that

[TABLE]

Now observe that

[TABLE]

Therefore,

[TABLE]

where the last inequality follows from our setting that $h\leq k,t=\frac{20\log n}{\phi_{\textrm{in}}^{2}}$ and $\phi_{\textrm{out}}\leq a_{\ref{thm:rw_clusterable}}\frac{\xi{\alpha\beta}\phi_{\textrm{in}}^{2}}{k^{3}\log n}$ , where $a_{\ref{thm:rw_clusterable}}>0$ is some sufficiently small constant.

Therefore, it holds that $\lVert\textbf{p}_{v}^{t}-\mathbf{\mathcal{U}}_{C_{i}}\rVert_{\textrm{TV}}=\frac{1}{2}\lVert\textbf{p}_{v}^{t}-\mathbf{\mathcal{U}}_{C_{i}}\rVert_{1}\leq\frac{1}{2}\sqrt{n}\cdot\lVert\textbf{p}_{v}^{t}-\mathbf{\mathcal{U}}_{C_{i}}\rVert_{2}\leq\xi$ .

5.2 From Clusterable Graphs to Noisy Clusterable Graphs

Now we analyze the random walk on a noisy clusterable graph $G$ , for which we use an induced Markov chain introduced in [KPS13] and some property of stopping rules of Markov chains [LW97].

A tool: stopping rules of Markov Chains.

Consider a finite, irreducible, discrete time Markov chain on the state space $V=[n]$ with stationary distribution $\pi$ . For any distribution $\sigma$ , we let $\sigma^{t}$ denote the distribution of a $t$ -step walk on the Markov chain with initial distribution $\sigma$ . A stopping rule $\Gamma$ of the Markov chain is a rule that observes the walk and decides whether to stop or not on the basis of what has been observed so far (see e.g., [LW97] for formal definition). Given a starting distribution $\sigma$ and a target distribution $\tau$ , we say that a stopping rule $\Gamma$ is a stopping rule from $\sigma$ to $\tau$ if the initial state is drawn from $\sigma$ and the final state is governed by $\tau$ . Let $\textrm{E}[\Gamma]$ denote the expected length before $\Gamma$ halts. For any two distributions $\sigma$ and $\tau$ , we let $\mathcal{H}(\sigma,\tau)$ denote the minimal expected length $\textrm{E}[\Gamma]$ among all stopping rules $\Gamma$ from $\sigma$ to $\tau$ .

Let $\sigma^{(t)}$ denotes the distribution of a uniform average walk of length $t$ with initial distribution $\sigma$ . The following lemma was proved by Lovász and Winkler.

Lemma 5.5 ([LW97]).

For any distribution $\tau$ , and any subset $U\subset V$ ,

[TABLE]

where $\tau^{m}$ denotes the probability vector of an $m$ step random walk on the Markov chain with initial distribution $\tau$ .

We remark that the above inequality was not explicitly stated in [LW97], while the proof of Lemma 4.22 in [LW97] directly implies the above Lemma.

An induced Markov chain.

Let $G=(V,E)$ be a $d$ -bounded graph. Let $\mathcal{M}$ be the Markov chain corresponding to the (normal) random walks on the input graph $G$ . For simplicity, we assume $\mathcal{M}$ is irreducible (i.e., the graph is connected). By definition, the stationary distribution $\pi$ of $\mathcal{M}$ is the uniform distribution $\mathbf{\mathcal{U}}_{V}$ on $V$ , that is $\pi(i)=\frac{1}{n}$ . Let $D$ denote a (large) subset of $V$ and let $B=V\setminus D$ . Now we describe the new Markov chain $\mathcal{M}^{\prime}$ , that has been considered in [KPS13], with state set $D$ as follows. For any two vertices $u,v\in D$ , the transition probability $\textbf{p}^{\prime}_{u}(v)$ in $\mathcal{M}^{\prime}$ is the sum of $\textbf{p}_{u}(v)$ , i.e., the transition probability from $u$ to $v$ in $\mathcal{M}$ , and the probability $\textbf{b}_{u}^{(t)}(v)$ that is equal to the total probability of all length $t$ walks from $u$ to $v$ all of whose states, except for the end points $u$ and $v$ are in $B$ , for any integer $t\geq 2$ . That is, $\textbf{p}^{\prime}_{u}(v)=\textbf{p}_{u}(v)+\sum_{t\geq 2}\textbf{b}_{u}^{(t)}(v)$ . The chain $\mathcal{M}^{\prime}$ is formally constructed by first retaining the original transition in $\mathcal{M}$ between $u,v$ and then adding new transitions $e_{u}^{(t)}(v)$ with transition probability $\textbf{b}_{u}^{(t)}(v)$ for any $t\geq 2$ , for any $u,v\in D$ .

We note that the chain $\mathcal{M}^{\prime}$ is the stochastic complement of $\mathcal{M}$ with respect to set $D$ [Mey89]. Let $\mathbf{P}=\bigl{(}\begin{smallmatrix}\mathbf{P}_{D}&\mathbf{P}_{1}\\ \mathbf{P}_{2}&\mathbf{P}_{B}\end{smallmatrix}\bigr{)}$ denote the transition probability matrix underlying $\mathcal{M}$ . We have the following lemma regarding the transition probability matrix $\mathbf{P}^{\prime}$ underlying $\mathcal{M}^{\prime}$ .

Lemma 5.6 ([Mey89]).

The Markov chain $\mathcal{M}^{\prime}$ is irreducible and aperiodic. Furthermore, its transition probability matrix is $\mathbf{P}^{\prime}=\mathbf{P}_{D}+\mathbf{P}_{1}(\mathbf{I}-\mathbf{P}_{B})^{-1}\mathbf{P}_{2}$ .

It is known (see e.g., [Mey89] and [KPS13]) that, the stationary distribution in $\mathcal{M}^{\prime}$ is given by the vector $\pi^{\prime}\in\mathbb{R}^{D}$ such that $\pi^{\prime}(u)=\frac{\pi(u)}{\pi(D)}=\frac{1}{|D|}$ for any $u\in D$ .

Now let us consider a vertex $s\in D$ and an integer $\ell$ that will be specified later. Let $\tau:={\textbf{p}^{\prime}}_{s}^{(\ell)}$ denote the distribution of a random walk of length $\ell$ starting from $s\in D$ in $\mathcal{M}^{\prime}$ . Consider the stopping rule $\Gamma$ that stops the walk in $\mathcal{M}$ as soon as it has taken $\ell$ steps in $\mathcal{M}^{\prime}$ , that is, $\Gamma$ is a stopping rule from $\textbf{1}_{s}$ to $\tau$ . Recall that $\textrm{E}[\Gamma]$ denotes the expected number of steps the walk takes starting from $s$ before being terminated by the stopping rule $\Gamma$ . The following lemma has been proven in [KPS13].

Lemma 5.7 ([KPS13]).

There exists a set $\tilde{B}\subseteq D$ with $\pi(\tilde{B})\leq\pi(B)$ such that for any $s\in D\setminus\tilde{B}$ , $\textrm{E}[\Gamma]\leq 2\ell$ . In particular, for any such vertex $s$ , $\mathcal{H}(\textbf{1}_{s},\tau)\leq 2\ell$ .

Now we use the above induced chain to analyze the random walks on noisy clusterable graphs. Let $G$ be a graph with an $h$ -partition $C_{i}$ , $i\leq h$ satisfying the precondition of Theorem 1.5. We let $D$ denote the union of all $D_{i}$ ’s with $|D_{i}|\geq 2|B_{i}|$ , that is, $D=\cup_{i:|D_{i}|\geq 2|B_{i}|}D_{i}$ and $B=V\setminus D$ . We consider the induced Markov chain $\mathcal{M}^{\prime}$ with state set $D$ .

Recall that we let $\mathbf{A}$ denote the adjacency matrix of the $d$ -regular graph $G^{\prime}$ corresponding to $G$ (see Section 2.) Then the transition probability matrix is $\mathbf{P}=\frac{\mathbf{I}+\frac{1}{d}\mathbf{A}}{2}$ . If we let $\mathbf{A}=\bigl{(}\begin{smallmatrix}\mathbf{A}_{D}&\mathbf{A}_{1}\\ \mathbf{A}_{2}&\mathbf{A}_{B}\end{smallmatrix}\bigr{)}$ , then by Lemma 5.6, the transition probability matrix of $G_{\mathbf{M}^{\prime}}$ is

[TABLE]

If we let $G_{\mathcal{M}^{\prime}}$ denote the (weighted) $d$ -bounded graph with adjacency matrix $\mathbf{A}_{D}+\mathbf{A}_{1}(2d\mathbf{I}-\mathbf{A}_{B})^{-1}\mathbf{A}_{2}$ , then by the above analysis (and the fact that $(2d\mathbf{I}-\mathbf{A}_{B})^{-1}\geq\mathbf{0}$ [Mey89]), $\mathcal{M}^{\prime}$ corresponds to the lazy random walk on the graph $G_{\mathcal{M}^{\prime}}$ .

In the following, we show that $G_{\mathcal{M}^{\prime}}$ is a clusterable graph with clusters $D_{i}\subseteq D$ , which will imply that the chain $\mathcal{M}^{\prime}$ has the nice local mixing property as guaranteed by Theorem 5.1. Then we can use the stopping rules to relate the chains $\mathcal{M}^{\prime}$ and $\mathcal{M}$ .

The following lemma shows that if we construct $\mathcal{M}^{\prime}$ as above for the graph that satisfies the precondition of Theorem 1.5, then $G_{\mathcal{M}^{\prime}}$ is $(k,\phi_{\textrm{in}},O(\phi_{\textrm{out}}))$ -clusterable. This is trivial for the case of $k=1$ (as in [KPS13]), as the inner conductance of any set is monotonically increasing. However, for general $k\geq 2$ , we need to deal with the difficulty of bounding the outer conductance of potential clusters, as the outer conductance of any set is also monotonically increasing due to our construction.

Lemma 5.8.

Let $G=(V,E)$ be a $d$ -bound graph with an $h$ -partition $C_{i}$ , $i\leq h$ such that $\phi_{G}(C_{i})\leq\phi_{\textrm{out}}$ . Furthermore, each $C_{i}$ can be partitioned into two subsets $D_{i}$ and $B_{i}$ such that $\phi(G[D_{i}])\geq\phi_{\textrm{in}}$ . Let $D=\cup_{i:|D_{i}|\geq 2|B_{i}|}D_{i}$ and $B=V\setminus D$ . Let $G_{\mathcal{M}^{\prime}}$ be the weighted graph corresponding to the Markov chain $\mathcal{M}^{\prime}$ on $D$ constructed as above. Then in the graph $G_{\mathcal{M}^{\prime}}$ , each $D_{i}\subseteq D$ has the inner conductance at least $\phi_{\textrm{in}}$ and outer conductance at most $3\phi_{\textrm{out}}$ .

Proof.

We first consider the inner conductance of $D_{i}$ in $G_{\mathcal{M}^{\prime}}$ . Let $S\subseteq D_{i}$ with $|S|\leq\frac{|D_{i}|}{2}$ . By the fact that the adjacency matrix of $G_{\mathcal{M}^{\prime}}$ is $\mathbf{A}_{D}+\mathbf{A}_{1}(2d\mathbf{I}-\mathbf{A}_{B})^{-1}\mathbf{A}_{2}$ , it holds that $|E_{G_{\mathcal{M}^{\prime}}}(S,D_{i}\setminus S)|\geq|E_{G}(S,D_{i}\setminus S)|\geq\phi_{\textrm{in}}d|S|$ . This implies that the inner conductance of $D_{i}$ in $G_{\mathcal{M}^{\prime}}$ is at least $\phi_{\textrm{in}}$ .

To bound the outer conductance of $D_{i}$ in $G_{\mathcal{M}^{\prime}}$ , we instead bound the outer conductance $\phi_{\mathcal{M}^{\prime}}(D_{i})$ of $D_{i}$ in the Markov chain $\mathcal{M}^{\prime}$ , which is defined to be $\phi_{\mathcal{M}^{\prime}}(D_{i}):=\frac{\sum_{u\in D_{i},v\in D\setminus D_{i}}\pi^{\prime}(u)\textbf{p}^{\prime}_{u}(v)}{\pi^{\prime}(D_{i})}$ , where $\textbf{p}_{u}^{\prime}(v)$ denotes the transition probability from $u$ to $v$ in the Markov chain $\mathcal{M}^{\prime}$ . Note that by our definitions, $\phi_{G_{\mathcal{M}^{\prime}}}(D_{i})=2\phi_{\mathcal{M}^{\prime}}(D_{i})$ .

Recall that $\pi^{\prime}(u)=\frac{1}{|D|}$ and that the transition probability matrix of $\mathcal{M}^{\prime}$ is $\mathbf{P}^{\prime}$ given by Equation (1). Then we have that

[TABLE]

where last equation follows from the Neumann Series $(\mathbf{I}-\frac{\mathbf{A}_{B}}{2d})^{-1}=\sum_{j=0}^{\infty}(\frac{\mathbf{A}_{B}}{2d})^{j}$ .

We bound each term in the right hand side of the above inequality as follows. First, we have that

[TABLE]

Furthermore, we observe that $\textbf{1}_{u}\cdot\mathbf{A}_{1}\cdot\mathbf{A}_{2}\cdot\textbf{1}_{v}^{T}$ is exactly the number of paths that start from $u$ , then go to a vertex $w\in B$ , and then move to $v$ . Thus,

[TABLE]

Similarly, for each $j\geq 1$ , $\textbf{1}_{u}\cdot\mathbf{A}_{1}\cdot\mathbf{A}_{B}^{j}\cdot\mathbf{A}_{2}\cdot\textbf{1}_{v}^{T}$ is exactly the number of paths that start from $u$ , then go to a vertex $w_{1}\in B$ , and move inside $B$ for the next $j$ steps until some vertex $w_{2}\in B$ , and then move to $v$ . We have that

[TABLE]

where in the first inequality, the third summation is taken over all possible paths $p$ from $w_{1}$ to some vertex $w_{2}\in B$ , such that the length of $p$ is $j$ and all vertices on $p$ belong to $B$ ; in the second inequality, we used the fact that the number of such paths $p$ is at most $d^{j}$ and each vertex has degree at most $d$ .

Thus,

[TABLE]

By the above inequalities (2),(3),(4), we obtain that

[TABLE]

where in the second to last inequality, we used the assumption that $|D_{i}|\geq 2|B_{i}|$ , which gives that $|D_{i}|\geq\frac{2}{3}|C_{i}|$ .

Therefore, $\phi_{G_{\mathcal{M}^{\prime}}}(D_{i})=2\phi_{\mathcal{M}^{\prime}}(D_{i})\leq 3\phi_{\textrm{out}}$ .

Now we are ready to prove Theorem 1.5.

Proof of Theorem 1.5.

Let $D=\cup_{j:|D_{j}|\geq 2|B_{j}|}D_{j}$ . Let $B=V\setminus D$ . Then it holds that $|B|=\sum_{1\leq i\leq h}|B_{i}|+\sum_{i:|D_{i}|<2|B_{i}|}|D_{i}|\leq 3\sum_{1\leq i\leq h}|B_{i}|\leq 3\varepsilon n$ , and $|D|\geq(1-3\varepsilon)n$ . We consider the induced Markov chain $\mathcal{M}^{\prime}$ on $D$ as above. By Lemma 5.8, the corresponding $d$ -bounded weighted graph $G_{\mathcal{M}^{\prime}}$ is $(k,\phi_{\textrm{in}},3\phi_{\textrm{out}})$ -clusterable. In particular, $\phi_{G_{\mathcal{M}^{\prime}}}(D_{i})\leq 3\phi_{\textrm{out}}$ and $\phi(G_{\mathcal{M}^{\prime}}[D_{i}])\geq\phi_{\textrm{in}}$ for any $D_{i}\subset D$ .

Let $\ell$ be an integer that will be specified later. For any $s\in D$ , we let $\tau_{s}:={\textbf{p}^{\prime}_{s}}^{(\ell)}$ being the probability distribution of an $\ell$ step random walk starting from $s$ in the induced Markov chain $\mathcal{M}^{\prime}$ . Let $\Gamma_{s}$ be the stopping rule from $\textbf{1}_{s}$ to $\tau_{s}$ which is obtained by stopping the random walk that starts at $s$ in $\mathcal{M}$ as soon as it has taken $\ell$ steps in $\mathcal{M}^{\prime}$ . Let $\tilde{B}\subseteq D$ be the set guaranteed by Lemma 5.7 such that $|\tilde{B}|\leq|B|\leq 3\varepsilon n$ and for any $s\in D\setminus\tilde{B}$ ,

[TABLE]

Now we set $a_{\ref{thm:rw_perturbed}}=\frac{a_{\ref{thm:rw_clusterable}}}{120}$ and thus $\phi_{\textrm{out}}\leq\frac{a_{\ref{thm:rw_clusterable}}\varepsilon\gamma^{4}\phi_{\textrm{in}}^{2}}{120k^{3}\log n}$ . We then apply Theorem 5.1 on $G_{\mathcal{M}^{\prime}}$ with $(\phi_{\textrm{in}},3\phi_{\textrm{out}})$ -clusters $D_{i}$ and $\alpha=3\sqrt{\varepsilon},\beta=3\sqrt{\varepsilon}$ , $\xi=\frac{\gamma}{6}$ , to obtain that for any $D_{j}$ with $|D_{j}|\geq 3\sqrt{\varepsilon}n\geq 3\sqrt{\varepsilon}|D|$ ,there exists a set $D_{j}^{\prime}$ with $|D_{j}^{\prime}|\geq(1-3\sqrt{\varepsilon})|D_{j}|$ such that for any $s\in D_{j}^{\prime}$ and $\ell=\frac{20\log n}{\phi_{\textrm{in}}^{2}}$ , it holds that $\lVert\tau_{s}-\mathbf{\mathcal{U}}_{D_{j}}\rVert_{\textrm{TV}}\leq\frac{\gamma}{6}$ . This implies that

[TABLE]

Now we set $\widehat{D}_{j}:=D_{j}^{\prime}\setminus\tilde{B}$ . Then it is guaranteed that for any $j$ with $|D_{j}|\geq 3\sqrt{\varepsilon}n$ , $|\widehat{D}_{j}|\geq(1-3\sqrt{\varepsilon})|D_{j}|-3\varepsilon n\geq(1-4\sqrt{\varepsilon})|D_{j}|$ . Thus, for any $s\in\widehat{D}_{j}$ , both inequalities (5) and (6) hold.

Now let us consider an arbitrary $s\in\widehat{D}_{j}$ . Let $\tau=\tau_{s}$ and $\sigma=\textbf{1}_{s}$ . By the precondition of the Theorem, we have that $t=\frac{120\log n}{\gamma\phi_{\textrm{in}}^{2}}=\frac{6\ell}{\gamma}$ . We further recall that ${\textbf{a}}_{s}^{t}$ denotes the distribution of a uniform average walk of length $t$ with initial distribution $\sigma$ in the original chain $\mathcal{M}$ . By applying Lemma 5.5 with $\sigma^{(t)}={\textbf{a}}_{s}^{t}$ and distribution $\tau$ , we obtain that for any $U\subset V$ ,

[TABLE]

where $\tau^{m}$ denotes the distribution of an $m$ step random walk on $G$ with initial distribution $\tau$ , that is $\tau^{m}=\tau\mathbf{P}^{m}$ . (Here we slightly abuse the notation $\tau$ and use it to denote the distribution on $V$ by adding zero coordinates corresponding to vertices in $V\setminus D$ ). This further implies that for any set $C_{j}$ and any $U\subseteq V$ ,

[TABLE]

Therefore,

[TABLE]

where the last inequality follows from inequality (5). Now recall that $\mathbf{P}=\frac{\textbf{I}+d^{-1}\textbf{A}}{2}$ denotes the transition probability matrix of the random walk. We will show the following claim.

Claim 5.9.

For any $0\leq m\leq t-1$ , it holds that $\lVert\mathbf{\mathcal{U}}_{C_{j}}\mathbf{P}^{m}-\mathbf{\mathcal{U}}_{C_{j}}\rVert_{\textrm{TV}}\leq\frac{\gamma}{3}.$

Assuming that the above claim holds, we have that for any $0\leq m\leq t-1$ ,

[TABLE]

where the last inequality follows from Ineq. (6) and Claim 5.9. This, together with inequality (7), gives that

[TABLE]

This will then finish the proof of the theorem.

Now we give the proof of Claim 5.9.

Proof of Claim 5.9.

For notational simplicity, we let $C=C_{j}$ . We write $\mathbf{P}=\sum_{i=1}^{n}\eta_{i}\textbf{v}_{i}\textbf{v}_{i}^{T}$ , where $\eta_{i}:=1-\frac{\lambda_{i}}{2}$ and $\textbf{v}_{i}$ ( $1\leq i\leq n$ ) denote the $i$ -th eigenvalue of $\mathbf{P}$ , respectively. Let $\mathbf{\mathcal{U}}_{C}=\sum_{i}\alpha_{i}\textbf{v}_{i}$ . Note that $\sum_{i=1}^{n}\alpha_{i}^{2}=\lVert\mathbf{\mathcal{U}}_{C}\rVert_{2}^{2}=\frac{1}{|C|}$ .

Note that

[TABLE]

which gives that $1-|C|\cdot\mathbf{\mathcal{U}}_{C}\mathbf{P}\mathbf{\mathcal{U}}_{C}^{T}\leq\frac{\phi_{\textrm{out}}}{2}$ . Thus, $1-|C|\sum_{i}\eta_{i}\alpha_{i}^{2}\leq\frac{\phi_{\textrm{out}}}{2}$ , or equivalently, $\sum_{i}\eta_{i}\alpha_{i}^{2}\geq\frac{1-\phi_{\textrm{out}}/2}{|C|}.$

Let $H=\{i:\eta_{i}\geq 1-\frac{x\phi_{\textrm{out}}}{2}\}$ , where $x=\frac{8}{\gamma^{2}}$ . Then we have that $\sum_{i\in H}\alpha_{i}^{2}+(1-\frac{x\phi_{\textrm{out}}}{2})\sum_{i\notin H}\alpha_{i}^{2}\geq\frac{1-\phi_{\textrm{out}}/2}{|C|}.$ Thus, $\sum_{i\in H}\alpha_{i}^{2}+(1-\frac{x\phi_{\textrm{out}}}{2})(\frac{1}{|C|}-\sum_{i\in H}\alpha_{i}^{2})\geq\frac{1-\phi_{\textrm{out}}/2}{|C|},$ which gives that

[TABLE]

Now we have that

[TABLE]

where we used our choice of parameters which satisfy that $t\phi_{\textrm{out}}\leq\gamma^{3}/16$ and $x=\frac{8}{\gamma^{2}}$ .

On the other hand, if we let $\textbf{D}_{C}$ denote the diagonal matrix such that $\textbf{D}_{C}(u,u)=1$ if $u\in C$ and [math] otherwise, then by Proposition 2.5 in [ST13], it holds that for any $m\geq 0$ ,

[TABLE]

This gives that

[TABLE]

Finally, by the above calculations, we have that

[TABLE]

This finishes the proof of the Claim.

This finishes the proof of Theorem 1.5.

6 Conclusions

We gave the first robust clustering oracle and local filter for reconstructing the cluster structure of bounded degree graphs. Both algorithms run in sublinear times. To design and analyze our algorithms, we formalized and proved a new behavior of random walks in a noisy clusterable graph: a random walk of appropriately chosen length from a typical vertex in a large cluster of the clusterable part will mix well in the corresponding cluster, which might be of independent interest.

It will be an interesting open question to design a local reconstruction algorithm that outputs a clusterable graph with better cluster-quality guarantee, especially to remove the $\Theta(\log n)$ gap between the inner conductances of the original graph and the corrected graph from our current result. In the property testing setting, such a gap was successfully closed, for both testing expansion ([CS10] vs. [KS11, NS10]) and for testing $k$ -clusterability ([CPS15] vs. [CKK*+*18]). However, for the local reconstruction setting, we even do not know how to remove such a logarithmic gap for reconstructing noisy expander graphs (i.e., $k=1$ ). As noted in [KPS13], for the case $k=1$ , one already needs to have more refined definitions of strong/weak vertices and much stronger results about random walks in noisy expander graphs. Removing the logarithmic gap from our result for locally reconstructing cluster structure for general $k\geq 1$ can be as hard, if not harder. Similar question can be asked for removing the $\Theta(\log n)$ gap between the inner and outer conductance of the input instance of our robust clustering oracle. As we mentioned before, there is evidence in [CKK*+*18] showing that this is difficult (for distribution distance based algorithms).

Acknowledgements.

We are thankful to anonymous reviewers of FOCS 2018 and STOC 2019 for valuable comments.

Appendix A Proof of Lemma 3.2

Proof of Lemma 3.2.

First, note that if there are more than $\sqrt{\kappa}|C|$ vertices in $C$ satisfying that ${\textbf{a}}_{u}^{t}(v)\leq(1-\sqrt{\kappa})/|C|$ , then $\lVert{\textbf{a}}_{u}^{t}-\mathbf{\mathcal{U}}_{C}\rVert_{TV}>\sqrt{\kappa}|C|\cdot\sqrt{\kappa}/|C|\geq\kappa$ , which contradicts to the fact that $u$ is strong with respect to $C$ .

Second, by the definition of the set $S_{u}^{\theta_{0}}$ and the fact that ${\theta_{0}}\leq 1/2$ , there can be at most $2\sqrt{n}$ vertices in $V\setminus S_{u}^{\theta_{0}}$ , and thus there are at least $(1-\sqrt{\kappa})|C|-2\sqrt{n}$ vertices $w\in S_{u}^{\theta_{0}}\cap C$ such that ${\textbf{a}}_{u}^{t}(w)\geq\frac{1-\sqrt{\kappa}}{|C|}$ . Thus

[TABLE]

where in the second inequality we used the fact that $|C|\geq 3\sqrt{\varepsilon^{\prime}}n=3\sqrt{\frac{6\varepsilon}{\phi}}n>\frac{8\sqrt{n}}{\sqrt{\kappa}}$ as $\varepsilon=\Omega(\frac{\phi}{{n}})$ .

Finally, since $u$ is strong with respect to $C$ , there are at least $(1-\sqrt{\kappa})|C|-2\sqrt{n}$ vertices $w\in S_{u}^{\theta_{0}}\cap C$ such that ${\textbf{a}}_{u}^{t}(w)\geq\frac{1-\sqrt{\kappa}}{|C|}$ . The same is true for $v$ . Thus, there are at least $(1-2\sqrt{\kappa})|C|-4\sqrt{n}$ vertices $w\in S_{u}^{\theta_{0}}\cap S_{v}^{\theta_{0}}\cap C$ such that $p_{u}(w),p_{v}(w)\geq\frac{1-\sqrt{\kappa}}{|C|}$ . Again, by the fact that $|C|>\frac{8\sqrt{n}}{\sqrt{\kappa}}$ , we have that

[TABLE]

This finishes the proof of the Lemma.

Appendix B Description of the Algorithm EstimateRCP

In the algorithm, $C$ is a sufficiently large constant.

[TABLE]

Appendix C Further Guarantees on the Locally Reconstructed Graph

In the following, we show that by sacrificing the inner conductance quality, we can also find a clustering of the reconstructed graph $G^{\prime}$ with small outer conductance.

Lemma C.1.

Let $\phi^{*}=\frac{a_{\ref{lemma:conductance}}\varepsilon\phi}{k^{4}\log n}$ . If $G$ is an $\varepsilon$ -perturbation of a $(k,\phi,\frac{a_{\ref{thm:rw_perturbed}}\varepsilon\kappa^{4}\phi}{3k^{3}\log n})$ -clusterable graph, then the resulting graph $G^{\prime}$ from the local reconstruction algorithm is $(k,\frac{\nu^{6}}{6^{k}}\phi^{*},\min\{k\nu,1\})$ -clusterable, for any $0\leq\nu\leq 1$ .

Proof.

We start with the $(k,\phi^{*},1)$ -clustering of $G^{\prime}$ that is guaranteed from Lemma 4.3. Let $C_{1},\cdots,C_{h}$ be a partition satisfying that $\phi(G^{\prime}[C_{i}])\geq\phi^{*}$ . Let $\nu\in[0,1]$ . We next carefully merge some of these clusters so that each part of the final partition will have both inner conductance at least $\frac{\nu^{k}}{6^{k}}\phi^{*}$ and outer conductance at most $\min\{k\nu,1\}$ .

If there exists $1\leq i\neq j\leq h$ such that $|C_{i}|\leq|C_{j}|$ with $|E^{\prime}(C_{i},C_{j})|\geq\nu d|C_{i}|$ , then we merge $C_{i}$ and $C_{j}$ to obtain a new cluster $C:=C_{i}\cup C_{j}$ . We repeat until the condition is violated.

Note that this process always terminates as each time the number of clusters decrease by $1$ . Furthermore, note that after termination, each cluster has outer conductance at most $\min\{1,k\nu\}$ by construction. Now we show that in each iteration, the merged $C=C_{i}\cup C_{j}$ still has large inner conductance. Let $S\subset C$ with $|S|\leq\frac{|C|}{2}$ . Let $S_{i}=S\cup C_{i}$ and $S_{j}=S\cup C_{j}$ . Note that it can not happen simultaneously that $|S_{i}|>\frac{|C_{i}|}{2}$ and $|S_{j}|>\frac{|C_{j}|}{2}$ . Now we have the following cases.

•

If both $|S_{i}|\leq\frac{|C_{i}|}{2}$ and $|S_{j}|\leq\frac{|C_{j}|}{2}$ , then

[TABLE]

•

If $|S_{j}|>\frac{|C_{j}|}{2}$ , then $|S|\leq|C_{i}|+|S_{j}|\leq|C_{j}|+|S_{j}|<3|S_{j}|$ .

If $|S_{j}|\geq(1-\frac{\nu}{2})|C_{j}|$ , then $|C_{i}|\geq\frac{2}{3}|C_{j}|$ as otherwise $|C|\leq\frac{5}{3}|C_{j}|$ and $|S|\geq|S_{j}|>\frac{|C|}{2}$ , a contradiction. Then $|S_{i}|\leq\frac{\nu}{2}|C_{j}|\leq\frac{\nu}{2}\frac{3}{2}|C_{i}|=\frac{3\nu}{4}|C_{i}|$ . Thus there will be at least $\frac{d\nu}{4}|C_{i}|$ edges between $S_{j}$ and $C_{i}\setminus S_{i}$ . Thus $\phi_{G[C]}(S)\geq\frac{|E^{\prime}(S_{j},C_{i})|}{d|S|}\geq\frac{\frac{d\nu}{4}|C_{i}|}{3d|S_{j}|}\geq\frac{\frac{d\nu}{4}\frac{2}{3}|C_{j}|}{3d|C_{j}|}=\frac{\nu}{18}$ . 2. 2.

If $|S_{j}|\leq(1-\frac{\nu}{2})|C_{j}|$ , then $|C_{j}\setminus S_{j}|\geq\frac{\nu}{2}|C_{j}|\geq\frac{\nu}{2(1-\frac{\nu}{2})}|S_{j}|$ . Therefore, $\phi_{G[C]}(S)\geq\frac{|E^{\prime}(S_{j},C_{j}\setminus S_{j})|}{d|S|}\geq\frac{\phi^{*}d|C_{j}\setminus S_{j}|}{3d|S_{j}|}>\frac{\phi^{*}\nu}{6}$ .

•

If $|S_{i}|>\frac{|C_{i}|}{2}$ , then it must hold that $|S_{j}|<\frac{|C_{j}|}{2}$ .

If $|S_{i}|<(1-\frac{\nu}{2})|C_{i}|$ , then $\frac{|C_{i}|}{2}\geq|C_{i}\setminus S_{i}|\geq\frac{\nu}{2}|C_{i}|$ . Thus $\phi_{G[C]}(S)\geq\frac{|E^{\prime}(S_{i},C_{i}\setminus S_{i})|+|E^{\prime}(S_{j},C_{j}\setminus S_{j})|}{d(|S_{i}|+|S_{j}|)}\geq\min\{\frac{\phi^{*}d|C_{i}\setminus S_{i}|}{d|S_{i}|},\frac{\phi^{*}d|S_{j}|}{d|S_{j}|}\}=\min\{\frac{\nu\phi^{*}}{2},\phi^{*}\}=\frac{\nu\phi^{*}}{2}$ . 2. 2.

If $|S_{i}|\geq(1-\frac{\nu}{2})|C_{i}|$ , then $|E^{\prime}(S_{i},C_{j})|\geq\frac{d\nu}{2}|C_{i}|$ . If $|E^{\prime}(S_{i},S_{j})|\geq\frac{1}{2}|E^{\prime}(S_{i},C_{j})|$ , then $|S_{j}|\geq\frac{\nu}{4}|C_{i}|$ , then $\phi_{G[C]}(S)\geq\frac{|E^{\prime}(S_{j},C_{j}\setminus S_{j})|}{d|S|}\geq\frac{\phi^{*}d|S_{j}|}{d(|S_{j}|+|C_{i}|)}\geq\frac{\phi^{*}\nu}{5}$ . Otherwise, $|E^{\prime}(S_{i},S_{j})|<\frac{1}{2}|E^{\prime}(S_{i},C_{j})|$ , then $|E^{\prime}(S_{i},C_{j}\setminus S_{j})|\geq\frac{1}{2}|E^{\prime}(S_{i},C_{j})|\geq\frac{d\nu}{4}|C_{i}|$ . Thus $\phi_{G[C]}(S)\geq\frac{|E^{\prime}(S_{i},C_{j}\setminus S_{j})|+|E^{\prime}(S_{j},C_{j}\setminus S_{j})|}{d(|S_{i}|+|S_{j}|)}\geq\min\{\frac{\frac{d\nu}{4}|C_{i}|}{d|S_{i}|},\frac{\phi^{*}d|S_{j}|}{d|S_{j}|}\}\geq\min\{\frac{\nu}{4},\phi^{*}\}$ .

From the above analysis, we know that if both $\phi(G[C_{i}])\geq\phi^{*}$ and $\phi(G[C_{j}])\geq\phi^{*}$ , then after merging $C_{i}$ and $C_{j}$ , the resulting cluster $C$ has inner conductance at least $\frac{\nu\phi^{*}}{6}$ . Since there will be at most $k$ iterations (or merges), we know that in the final partition $\mathcal{P^{\prime}}$ , each part has outer conductance at most $\min\{k\nu,1\}$ and inner conductance $\frac{\nu^{k}\phi^{*}}{6^{k}}=\frac{a_{\ref{lemma:conductance}}\nu^{k}}{6^{k}k^{4}}\frac{\varepsilon\phi}{\log n}$ . This proves the statement of the lemma.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABJ 18] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Latin American Symposium on Theoretical Informatics , pages 14–27. Springer, 2018.
2[ABJK 18] Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA , pages 40:1–40:21, 2018.
3[ACCL 08] Nir Ailon, Bernard Chazelle, Seshadhri Comandur, and Ding Liu. Property-preserving data reconstruction. Algorithmica , 51(2):160–182, 2008.
4[ACL 06] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06) , pages 475–486. IEEE, 2006.
5[AKBD 16] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in neural information processing systems , pages 3216–3224, 2016.
6[AOPT 16] Reid Andersen, Shayan Oveis Gharan, Yuval Peres, and Luca Trevisan. Almost optimal local graph clustering using evolving sets. Journal of the ACM (JACM) , 63(2):15, 2016.
7[AP 09] Reid Andersen and Yuval Peres. Finding sparse cuts locally using evolving sets. In Proceedings of the forty-first annual ACM symposium on Theory of computing , pages 235–244. ACM, 2009.
8[ARVX 12] Noga Alon, Ronitt Rubinfeld, Shai Vardi, and Ning Xie. Space-efficient local computation algorithms. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms , pages 1132–1139. Society for Industrial and Applied Mathematics, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Robust Clustering Oracle and Local Reconstructor of Cluster Structure of Graphs

1 Introduction

1.1 Basic Definitions

Conductance based clustering.

Definition 1.1**.**

Clusterable graphs with modeling noise.

Definition 1.2**.**

1.2 Problem Formalizations and Main Results

Robust clustering oracle.

Theorem 1.3** (Robust Clustering Oracle).**

Local reconstructor of graph cluster structure.

Theorem 1.4** (Local Reconstructor of Cluster Structure).**

Local mixing property on noisy clusterable graphs.

Theorem 1.5** (Local Mixing Property of Random Walks).**

1.3 Our Techniques

1.4 Relation to Testing Graph Clusterability

1.5 Other Related Work

Organization of the paper.

2 Preliminaries

Different types of random walks on GGG.

A simple reduction: from ddd-bounded graphs to ddd-regular graphs.

Estimating reduced collision probabilities.

Lemma 2.1** ([KPS13]).**

3 Robust Clustering Oracle

3.1 The Analysis of Robust Clustering Oracle

Definition and properties of strong vertices.

Definition 3.1**.**

Lemma 3.2**.**

The correctness of the robust clustering oracle.

Lemma 3.3**.**

Proof.

Claim 3.4**.**

Proof of Claim 3.4.

Running time and query complexity.

Remark.

4 The Local Reconstruction Algorithm

Explicit expanders.

4.1 Analysis of the Local Reconstruction Algorithm

Definition and property of weak vertices.

Definition 4.1**.**

Lemma 4.2**.**

Proof.

Cluster structure of G′G^{\prime}G′.

Lemma 4.3**.**

Proof.

Claim 4.4**.**

Proof.

5 Local Mixing Property of Random Walks on Noisy Clusterable Graphs: Proof of Theorem 1.5

5.1 Local Mixing Property of Random Walks on Clusterable Graphs

Theorem 5.1**.**

Lemma 5.2** (Lemma 5.2 in [CPS15] and Lemma 10 in [CKK*+*18]).**

Fact 5.3**.**

Theorem 5.4**.**

Proof.

Proof of Theorem 5.1.

5.2 From Clusterable Graphs to Noisy Clusterable Graphs

A tool: stopping rules of Markov Chains.

Lemma 5.5** ([LW97]).**

An induced Markov chain.

Lemma 5.6** ([Mey89]).**

Lemma 5.7** ([KPS13]).**

Lemma 5.8**.**

Proof.

Proof of Theorem 1.5.

Claim 5.9**.**

Proof of Claim 5.9.

6 Conclusions

Acknowledgements.

Appendix A Proof of Lemma 3.2

Proof of Lemma 3.2.

Appendix B Description of the Algorithm EstimateRCP

Appendix C Further Guarantees on the Locally Reconstructed Graph

Lemma C.1**.**

Definition 1.1.

Definition 1.2.

Theorem 1.3 (Robust Clustering Oracle).

Theorem 1.4 (Local Reconstructor of Cluster Structure).

Theorem 1.5 (Local Mixing Property of Random Walks).

Different types of random walks on $G$ .

A simple reduction: from $d$ -bounded graphs to $d$ -regular graphs.

Lemma 2.1 ([KPS13]).

Definition 3.1.

Lemma 3.2.

Lemma 3.3.

Claim 3.4.

Definition 4.1.

Lemma 4.2.

Cluster structure of $G^{\prime}$ .

Lemma 4.3.

Claim 4.4.

Theorem 5.1.

Lemma 5.2 (Lemma 5.2 in [CPS15] and Lemma 10 in [CKK+18]).

Fact 5.3.

Theorem 5.4.

Lemma 5.5 ([LW97]).

Lemma 5.6 ([Mey89]).

Lemma 5.7 ([KPS13]).

Lemma 5.8.

Claim 5.9.

Lemma C.1.