Robust Clustering Oracle and Local Reconstructor of Cluster Structure of Graphs
Pan Peng

TL;DR
This paper introduces sublinear time algorithms for analyzing and reconstructing the cluster structure of large, noisy graphs using conductance-based definitions, enabling efficient local clustering and property testing.
Contribution
It formalizes noisy clusterable graphs, develops a robust clustering oracle, and provides a local reconstructor, all operating in sublinear time with noisy data.
Findings
Developed a sublinear time algorithm for analyzing cluster structure.
Constructed a robust clustering oracle supporting typical cluster queries.
Designed a local reconstructor for noisy clusterable graphs.
Abstract
Due to the massive size of modern network data, local algorithms that run in sublinear time for analyzing the cluster structure of the graph are receiving growing interest. Two typical examples are local graph clustering algorithms that find a cluster from a seed node with running time proportional to the size of the output set, and clusterability testing algorithms that decide if a graph can be partitioned into a few clusters in the framework of property testing. In this work, we develop sublinear time algorithms for analyzing the cluster structure of graphs with noisy partial information. By using conductance based definitions for measuring the quality of clusters and the cluster structure, we formalize a definition of noisy clusterable graphs with bounded maximum degree. The algorithm is given query access to the adjacency list to such a graph. We then formalize the notion of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplexity and Algorithms in Graphs · Advanced Graph Theory Research · Machine Learning and Algorithms
\newconstantfamily
csymbol=c
\newconstantfamilysmallconstsymbol=κ
Robust Clustering Oracle and Local Reconstructor of Cluster Structure of Graphs
Pan Peng111 Department of Computer Science, University of Sheffield, Sheffield, U.K. Email: [email protected].
Due to the massive size of modern network data, local algorithms that run in sublinear time for analyzing the cluster structure of the graph are receiving growing interest. Two typical examples are local graph clustering algorithms that find a cluster from a seed node with running time proportional to the size of the output set, and clusterability testing algorithms that decide if a graph can be partitioned into a few clusters in the framework of property testing.
In this work, we develop sublinear time algorithms for analyzing the cluster structure of graphs with noisy partial information. By using conductance based definitions for measuring the quality of clusters and the cluster structure, we formalize a definition of noisy clusterable graphs with bounded maximum degree. The algorithm is given query access to the adjacency list to such a graph. We then formalize the notion of robust clustering oracle for a noisy clusterable graph, and give an algorithm that builds such an oracle in sublinear time, which can be further used to support typical queries (e.g., IsOutlier(), SameCluster()) regarding the cluster structure of the graph in sublinear time. All the answers are consistent with a partition of in which all but a small fraction of vertices belong to some good cluster. We also give a local reconstructor for a noisy clusterable graph that provides query access to a reconstructed graph that is guaranteed to be clusterable in sublinear time. All the query answers are consistent with a clusterable graph which is guaranteed to be close to the original graph.
To obtain our results, we give new analysis of the behavior of random walks on a noisy clusterable graph, which consists of a large subset that induces a clusterable graph and a small unknown subgraph (the noise). We show that a random walk of appropriately chosen length from a typical vertex in a large cluster of the clusterable part will mix well in the corresponding cluster. Using this we are able to distinguish vertices from the clusterable part from those in the noisy part.
1 Introduction
Graph clustering is a fundamental task arising from many domains, including computer science, social science, network analysis and statistics. Given a graph, the task is to group the vertices into reasonably good clusters, where vertices inside the same cluster are well-connected to each other, and any two different clusters are well-separated. Such clusters convey valuable information of large graphs, and have concrete applications in recommendation systems, search engine, network routing and many others (see e.g., surveys [Sch07, POM09, For10, New12]). Many efficient global clustering algorithms that run in polynomial time have been proposed for analyzing the structure of graphs, where the goal is to find the overall cluster structure of a graph. Almost all such algorithms need to at least read the whole input of the graph and thus run in linear time. Actually, even just outputting all the clusters will require time, where is the number of the vertices of the graph. These algorithms, though considered to be efficient in the classical algorithm design, are becoming impractical (and sometimes even impossible) to be used for processing and analyzing modern very large networks/graphs (e.g., WWW and social networks).
Therefore, local algorithms that run in sublinear time for analyzing the cluster structure of the graph are receiving growing interest. Such algorithms are typically assumed to be able to explore the input graph by performing appropriate queries, e.g., query the degree or the neighbor of any node. There have been two main frameworks for designing sublinear algorithms for graph clustering, if one uses the well-motivated notion conductance (see below) to measure the quality of clusters. In the first one, called local graph clustering, the goal is to find a cluster from a specified vertex with running time that is bounded in terms of the size of the output set (and with a weak dependence on ) (see e.g., [ST13, ACL06, AP09, OT12, AOPT16, ZLM13, OZ14]). If the target cluster has much small size, then the running time of the resulting algorithm will be sublinear in the input size. In the second one, called testing cluster structure in the framework of property testing, the goal is to distinguish if an input graph has a typical cluster structure or is far from such cases (see [CPS15, CKK*+*18] and more discussions below). Such algorithms make decisions on the global cluster structure of the input graph by sampling vertices and locally exploring a small portion of the graph, and they can be served as a preliminary step before learning the cluster structure.
In this work, we study local and sublinear algorithms for analyzing the cluster structure of graphs that may contain noise and/or outliers. In many real applications, due to external noise or errors, the network data set may fail to have the desired property (here, the cluster structure), while it might still be close to have this property. That is, the graph under our consideration is some kind of perturbation of a clusterable graph or a noisy clusterable graph: is first chosen from some class of clusterable graphs with an underlying while unknown partition, and then some noise and/or outliers are introduced by some adversary or in some random way. This is a relaxation of a common assumption for many existing clustering algorithms that the input graph is simply well clusterable. We would like to very efficiently process such a noisy clusterable graph and extract useful information regarding its cluster structure. Slightly more precisely, we study two types of sublinear algorithms for analyzing the cluster structure of graphs with noisy partial information.
The first type of algorithm is driven by the following natural question: Given a noisy clusterable graph, can we build an oracle (or implicit representation) in sublinear time, that can support typical queries regarding the cluster structure of the graph in sublinear time? For example, we would like to query “Is a vertex a noise/outlier?”. If the answer is “No”, we would further like to know “Which cluster does belong to?”, and “Do and belong to the same cluster?”, given that both vertices are not outliers. We would require that all the query answers will be consistent, e.g., if are reported to belong to the same cluster, are reported to belong to the same cluster, then will also be reported to belong to the same cluster. Furthermore, we would like to minimize the number of vertices for which the oracle returns the “wrong” answers in the sense that the output partition of the algorithm should be close to an underlying maximal good clustering of the graph. We will call such an oracle as a robust clustering oracle. Such oracles might be already interesting from real-world applications. For example, quickly identifying outliers might be valuable in road networks and medical data. Sometimes, we only want the cluster information of a small group of vertices while do not care about other parts of the graph. Furthermore, it will be desirable to work on-the-fly on a clean data after removing a small fraction of outliers. Besides these real-world applications, such oracles might be given as input for other clustering algorithms that are equipped with the power of making the above mentioned clustering queries (see e.g., [MS17b, MS17a, AKBD16, ABJK18, ABJ18]).
Our second type of algorithm is motivated by a very related question: Given a noisy clusterable graph, can we fix it by minimally modifying the original graph, and provide query access to the reconstructed clusterable graph in sublinear time? We address this question in the online reconstruction framework introduced by [ACCL08]. In this framework (for graphs), given a property and query access to a graph that is close to have , we want to output a graph such that has the property and is modified minimally to get . Furthermore, we would like to output in a local and consistent way that can provide query access to by making as few queries to the input graph . The corresponding algorithm will be called a local reconstructor or local filter for property [ACCL08, SS10, AT10]. The natural application of such local reconstructors is when only a small portion of the corrected graph is needed or when we want to make use of the graph in a distributed manner. (Note that in many applications, queries are made to a large graph which are assumed to exhibit some structural property.) Here, we would focus on designing a local filter for cluster structure of graphs and providing consistent query access to a clusterable graph. In practice, such algorithms might be used for fast recommending products to users even if there are some noise in the data.
In this work, we give both sublinear robust clustering oracle and local reconstructors for the cluster structure of graphs. Now we give basic definitions of clusters and (noisy) clusterable graphs, formalize our algorithmic problems, state our main results and sketch our technical ideas.
1.1 Basic Definitions
Conductance based clustering.
Following a recent line of research on graph clustering (e.g., [OT14, CPS15, PSZ17, DPRS19], which were built upon [KVV04]), we will use conductance based definition for measuring the quality of clusters and the cluster structure of graphs. In this paper, we will focus on undirected graphs with bounded maximum degree. We call an undirected graph a -bounded graph if its maximum degree is upper bounded by some parameter , which is always assumed to be some sufficiently large constant (at least ). For any two subsets , we let denote the set of edges with one endpoint in and the other point in . The conductance of a set in is defined to be the ratio between the number of edges crossing and its complement and the maximum number of edges possible incident to , that is, The conductance of the graph is defined to be the minimum value of conductance of set with size at most , that is, For convenience, for the singleton graph (that consists of a single vertex with no edges) we define its inner conductance to be .
Given a vertex set , we let denote the subgraph graph induced by vertices in . In the following, we will refer to and as the outer conductance and inner conductance, respectively. Given two parameters and , we call a set a -cluster if
[TABLE]
For a good cluster , we expect to be large and to be small. In particular, if and for some constant , then we call the graph a -expander which by itself is a good cluster and has been extensively studied in theoretical computer science (see e.g., [HLW06]). It is useful to note that . When is clear from the context, we omit the subscript from . A -partition of a graph is a partition of into subsets, such that for and . We have the following definition of clusterable graphs that characterize graphs with typical cluster structure (see e.g., [OT14]).
Definition 1.1**.**
Given parameters , we call a -partition of a -bounded graph a -clustering if for each , and .
A -bounded graph is called to be -clusterable if has an -clustering for some .
Note that in our definition, a -clusterable graph may contain less than clusters, and -clusterable graphs are equivalent to -expanders.
Clusterable graphs with modeling noise.
We assume that the input graph to the algorithm is generated from the family of all -clusterable graphs and then modified by an adversary in some manner. We have the following definition.
Definition 1.2**.**
(Clusterable Graphs with Modeling Noise or Noisy Clusterable Graphs) In this model, the adversary first chooses an arbitrary graph from the family of all -clusterable graphs with maximum degree upper bounded by . Then the adversary may do the following:
Choose an arbitrary -clustering of for some . 2. 2.
Insert and/or delete at most edges (noise) within the clusters , , while preserving the degree bound.
We call the resulting graph an -perturbation of with respect to the -partition .
Equivalently, a graph is called to be an -perturbation of a -clusterable graph if there is partition of with at most parts (called clusters), such that one can insert/delete at most intra-cluster edges to make it a -clusterable graph. For simplicity, in the above definition, we only allowed the adversary to perturb the edges inside the clusters, while our algorithm can actually be extended to work for the case that the adversary is also allowed to perturb inter-cluster edges, up to a very limited extent222More precisely, the adversary can be allowed to perturb a fraction of inter-cluster edges: this essentially can then be reduced to the case that only intra-cluster perturbations are allowed by re-scaling a constant factor of conductance values, i.e., one can view that the adversary first chooses a -clusterable graph and then perturbs its intra-cluster edges.. This definition generalizes the notion of noisy expander graphs studied by Kale, Peres, and Seshadhri [KPS13], which correspond to in our problem. In their setting, the adversary first chooses a -expander and then modifies it by inserting/deleting fraction of edges in the graph.
1.2 Problem Formalizations and Main Results
Now we formalize our algorithmic problems and present our main results. For a -bounded graph , we will assume the algorithm is given query access to the adjacency list of , that is, in constant time we can query the -th neighbor of any vertex .
Robust clustering oracle.
Given query access to the adjacency list of a -bounded graph that is promised to be an -perturbation of a -clusterable graph, we are interested in constructing an implicit representation, called a robust clustering oracle, of in sublinear time such that typical queries regarding the cluster structure of can be answered as quickly as possible (also in sublinear time). More precisely, the oracle should support the following types of clustering queries:
IsOutlier(): Is a vertex a noise/outlier?
Intuitively, a vertex that does not belong to any good cluster should be reported as noise or outlier. For any non-outlier vertices , the oracle can further support
WhichCluster(): Which cluster does belong to?
- 3)
SameCluster(): Do and belong to the same cluster?
In the following, without loss of generality, we will assume that for any non-outlier vertex and the corresponding WhichCluster() query, the oracle will output an integer with that specifies the index of the cluster that belongs to, for some integer . Furthermore, given the ability of answering WhichCluster queries, for any two non-outlier vertices , we simply define SameCluster() to be the procedure that checks if WhichCluster() is equal to WhichCluster(). This will naturally ensures the consistency for SameCluster queries. Note that the output of the algorithm naturally defines a partition of , i.e.,
[TABLE]
We would like to minimize the number of vertices for which the oracle returns the “wrong” answers. That is, for most vertices that do belong to some underlying good cluster in the perturbed , we expect IsOutlier() to return “No”. Furthermore, for most vertices that belong to the same cluster (resp. different clusters), we expect SameCluster() to return “Yes” (resp. “No”). One further crucial requirement of a robust clustering oracle and the corresponding clustering query algorithm is to maintain consistency among all queries. That is, on different query sequences, the answers of the oracle should be consistent with the same -partition of for some , in which all but a small fraction of vertices belong to some good cluster. Since the oracle construction and the corresponding query algorithm are typically randomized, we fix the randomness seed of the oracle and query algorithm once and for all to ensure consistent answers. Then the algorithm will be a deterministic procedure for any input query, which further guarantees that the partition is determined by and the internal randomness of the oracle and the algorithm, and is independent of the order of queries. This feature allows the oracle to be used in the distributed manner as consistency is guaranteed.
We provide the first robust clustering oracle with both sublinear preprocessing time and query time. For simplicity, we will assume both are constant throughout the paper. Let denote the symmetric difference between two vertex sets .
Theorem 1.3** (Robust Clustering Oracle).**
There exists an algorithm that takes as input parameters , , , , and has query access to the adjacency list of a graph that is an -perturbation of a -clusterable graph, and constructs a robust clustering oracle in pre-processing time. Furthermore, it holds that
Using the oracle, the algorithm can answer any clustering query (i.e., IsOutlier, WhichCluster or SameCluster) in time. 2. 2.
There exists a partition of , for some , such that
- •
the partition only depends on and the input parameters of the algorithm, and is independent of the order of queries;
- •
if , then and each is a -cluster, for any ; if , then ; and
- •
with probability at least , the partition output by the algorithm satisfies that and .
We remark that there is no algorithm that allows both pre-processing time and query time for IsOutlier queries, as otherwise, one could obtain a property testing algorithm for expansion with queries, which will be a contradiction to a known lower bound [GR00] (see more discussions below on relation to property testing). Furthermore, the second item of the theorem implies that the total number of vertices that are reported as outliers is at most and that the query answers are consistent with a partition of in which all but vertices belong to a -cluster. We also note that in the statement of the above theorem, the most interesting range of is333Note that in this range, , which is also the reason that we do see the traditional dependency (from Cheeger’s inequality) between the outer conductance and inner conductance. , as otherwise (i.e., ) the noise will be too much and our algorithm cannot guarantee to locally identify even one cluster. Removing the gap between the inner conductance and outer conductance seems to be hard, at least for methods that are based on random walk distances (as we used here). For example, in [CKK*+*18], it has been discussed that in general, it is impossible to use Euclidean distance between random walk distributions to test -clusterablity if one wants the gap to be a constant. (Testing -clusterability is an easier problem than the robust clustering oracle problem; see below.) On the other hand, being able to correctly answer SameCluster() queries intuitively requires or induces a distance based approach, as the vertices in the same cluster are “similar” or “close to” each other, while vertices in different clusters are “dissimilar” or “far from” each other.
Local reconstructor of graph cluster structure.
We are interested in designing a local reconstruction algorithm for the cluster structure of graphs. Given query access to the adjacency list of a -bounded graph that is promised to be an -perturbation of a -clusterable graph, our goal is to design a local filter that provides query access to a -clusterable graph such that the distance between and is as close as possible. That is, we would like to output in a local manner that for any vertex query, the neighborhood of , i.e., the set of all neighbors of , in can be answered in sublinear time (in particular, by making as few queries to the adjacency list to as possible). Similar as for the robust clustering oracle, it is crucial to require a local filter to maintain consistency among all queries. Here we require that for different query sequences, the answers of the filter should be consistent with the same reconstructed graph . Again, the filter is suitable to be used in the distributed manner as consistency is guaranteed. In our local filter for clusterable graphs, we also aim to make the gap between and the gap between and as small as possible. We next state our theorem regarding our local filter for clusterable graphs as follows.
Theorem 1.4** (Local Reconstructor of Cluster Structure).**
There exists a local reconstruction algorithm that takes as input parameters , , , , and has query access to the adjacency list of a graph that is an -perturbation of a -clusterable graph, and provides query access to a graph such that the following holds with probability at least :
* is -clusterable, and has maximum degree at most .* 2. 2.
The number of edges changed is at most . 3. 3.
* is determined by and the internal randomness of the algorithm, and is independent of the order of queries.* 4. 4.
On each query , the neighborhood of in can be answered in time.
Note that by Item 1, the resulting graph can be partitioned into at most parts, each with relatively large inner conductance (i.e., ), with no guarantee on outer conductance (as each set trivially has outer conductance at most ). (Such instances are exactly the object that was studied in [CKK*+*18] in the framework of property testing.) By sacrificing the inner conductance quality, we can also find a clustering of with small outer conductance. That is, we can guarantee that is also -clusterable for any (see Appendix C for details). Item 3 implies that all query answers are consistent, that is, the vertex is output as a neighbor of in if and only if is output as a neighbor of . From the discussion below on the connections between our local reconstruction algorithm and property testing, the running time of our filter is optimal (in terms of dependency on ) up to polylogarithmic factors.
Furthermore, our algorithm generalizes the local reconstruction algorithm for expander graphs by [KPS13], which corresponds to the special case in our problem, though our approximation ratio of the number of modified edges is worse. More precisely, for , both our algorithm and the algorithm in [KPS13] will add edges (as the noise part is too large, and thus almost all vertices will be reported as outliers and the resulting graph is almost the complete hybrid of the original graph and an explicitly constructible expander (see Section 1.3 for more discussions)); for , the algorithm in [KPS13] reconstructs a graph that is an -perturbation of a -expander by modifying at most edges, and the resulting graph has conductance at least and maximum degree also upper bounded by444Note that [KPS13] claimed that the number of modified edges is at most and the maximum degree of the resulting graph is . However, this claim is not correct (at least for -bounded graphs with being constant), and the number of changed edges and the maximum degree bound from their analysis should be and , respectively [Ses19]. They obtained their claimed results by adding parallel edges while repairing bad vertices, from which they get that the maximum degree is and the number of added edges to the optimal distance (i.e., ) is , which is incorrect as it always holds that for constant and large enough . , while our algorithm has to modify edges. We further note that the algorithm in [KPS13] guarantees that the reconstructed graph has inner conductance at least , while the resulting graph from our algorithm is guaranteed to have a partition with at most parts, each with inner conductance at least . Removing the factor in the inner conductance of the output graph seems to be a very challenging task, even for the case . See Section 6 for more discussions.
Local mixing property on noisy clusterable graphs.
In order to derive the above algorithmic results, we prove an interesting behavior, which we call local mixing property, of random walks on noisy clusterable graphs. For technical reasons, we will consider the uniform averaging walk of steps on a graph : In this walk, we choose a number uniformly at random, and stop the (normal) random walk after steps. We let denote the probability vector for a uniform averaging walk of steps starting at and let denote the total variance distance between two distributions . We have the following theorem.
Theorem 1.5** (Local Mixing Property of Random Walks).**
Let . Let for some sufficiently small constant . Let be a -bounded graph with an -partition such that for any . For each , we let denote a large subset of vertices such that , and let . If , then for any with , there exists a subset such that such that for any , and , it holds that
[TABLE]
Intuitively, the set corresponds to the noisy part inside each cluster and we assume that the total fraction of noisy part is parametrized by . Then the above theorem says that the rest of the large part (i.e., clusterable part) exhibits some nice local mixing property: a typical uniform averaging random walk (of appropriately chosen length) from a large cluster (of size ) will converge quickly to the uniform distribution on it. This is a generalization of the global mixing property of noisy expander graphs in [KPS13], though their results are stated for the more general Markov chains.
1.3 Our Techniques
To design a robust clustering oracle, we first note that it is relatively easy to design a clustering oracle without noise (if the gap between and is as we considered here). This can be done by a refined analysis of the property testing algorithm in [CPS15] that samples a small number of vertices, and then test if the norm distance between the random walk distributions from any two vertices is larger than some threshold or not. However, the analysis depends on the spectral property (e.g., a gap between and ) of clusterable graphs, and cannot be easily generalized to the case that the input graph contains noise, as such spectral property is very sensitive to noise (e.g., deleting all edges incident to a constant number of vertices will break down the property).
In order to handle noisy input, we use the norm distance between the corresponding random walk distributions to test if the starting two vertices belong to the same cluster or not, and we make use of the local mixing property of random walks in Theorem 1.5. In order to prove the such a mixing property, we first show that it does hold for clusterable graphs without noise, by exploiting a spectral property that characterizes the first eigenvectors of clusterable graphs given by [PSZ17]. To generalize the result to a noisy clusterable graph , we view the random walks on the graph as a Markov chain and consider a new Markov chain that is induced on vertices in the clusterable part in . (Such a new chain has also been used in [KPS13] for analyzing noisy expanders.) We show the induced Markov chain does correspond to a clusterable graph (by overcoming the difficulty that the outer conductance of each corresponding cluster increases and might change the cluster structure too much) and thus the random walks in satisfy the local mixing property. However, the walks on can be very different from the random walks in the original graph . We then give a novel application of an old technique called stopping rules of Markov chains that was introduced by Lovósz and Winkler [LW97] to relate these two walks, and bound the total variance distance between two random walk distributions from a vertex in any large cluster of and . This allows us to show the local mixing property in the graph . To the best of our knowledge, we are the first to use of the tool of stopping rules to show that a random walk in the graph mixes inside a subgraph (i.e., cluster) rather than in the whole graph.
Given such a local mixing property of random walks in the noisy clusterable graph, we are able to design a robust clustering oracle and the corresponding clustering query algorithm with sublinear preprocessing and query time. We first note that if the noisy part is not too large (i.e., ), then the graph has a non-trivial partition with that only depends on the corresponding parameters (i.e., ) and itself, and that each is a good cluster with large size (containing at least fraction of vertices), and has small size. Our key idea is to use random walks to learn a succinct representation , which is a weighted graph with roughly vertices, of the clusterable part of graph , such that each cluster in will be mapped to a unique clique (called a core) in with appropriate edge weight. Furthermore, by using the weights and the size bounds of these cliques, we can be efficiently identify them from , using which we are able to answer the WhichCluster queries. Slightly more precisely, in the preprocessing (or learning) phase, the algorithm samples a set of vertices, and uses the statistics of random walks from each sampled vertex to (quite accurately) estimate the so-called reduced collision probability (rcp) of (the random walks of appropriate length from) any two sampled vertices that was introduced in [KPS13]. We construct a weighted similarity graph on the sample set such that the weight of each edge is our estimate of the rcp of , for any . We show that if the noisy part is not too large, then, by the aforementioned local mixing property, for (most) pair of vertices , the rcp of will be close to . Thus, the weight of edge in will be set to be a number close to , and most vertices in form a clique in with edge weights close to . We further observe that has relatively large size (roughly ), as is large; and that any vertex can only belong to exactly one such (large) clique, as otherwise, the total probability mass of random walk distribution from will exceed , which can not happen. These properties allow us to efficiently identify the unique core from that corresponds to the cluster by a simple greedy algorithm and further to answer membership queries. We remark that in [CPS15], a similarity graph is also constructed, while that graph is unweighted and only tells if the original graph is -clusterable or not according to the number of connected components, which is far from sufficient for our application.
Then in the query phase, we check if the queried vertex belongs to any of the learned cores or not to decide if it is an outlier or not. This, again, can be done by estimating the rcp of the walks from and other vertices in (by running random walks), and is guaranteed by the local mixing property of random walks. In particular, for most vertices in a cluster , the rcp of random walks from and any other vertex that is in corresponding to will be also around . If this is the case, we output as the index of the cluster that belongs to; otherwise, we report it as an outlier. The above analysis shows that most vertices in will be correctly classified, or equivalently, the number of vertices that are reported as outliers is small.
Our local reconstruction algorithm for clusterable graphs is built upon our robust clustering oracle. That is, we first learn the cores of the input graph as before. Then (if the noisy part is not too large) we only “repair” all the vertices that are reported as outliers. Let be any vertex that is reported as an outlier. We add all the neighbors of in an explicit expander to “repair” the graph , which is called a hybridization (between and ) and has been used to repair expander graphs in [KPS13]. Then the answers is guaranteed to be consistent with a graph such that its distance to the original graph is at most times the number of vertices that are reported as outliers, which has already been bounded to be small. In order to prove the claimed guarantee on cluster structure of , we introduce a definition of weak vertices that intuitively correspond to the noisy part of the graph. Such a definition has also been used in [KPS13], though ours is more subtle, depending on the size of noise. We can show that one can improve the cluster structure of the graph if we have repaired all the weak vertices in the above way. Furthermore, such weak vertices will always be reported as outliers, which is guaranteed by the performance of our robust clustering oracle.
1.4 Relation to Testing Graph Clusterability
Both the above robust clustering oracle and local reconstruction are closely related to the framework of property testing [RS96, GGR98]. In the bounded degree graph property testing [GR02], given a property , the algorithm aims to distinguish graphs that satisfy from graphs that are -far from satisfying by making as few queries (to the adjacency list of the graph) as possible, with high constant probability, say at least . Here, a graph is said to be -far from satisfying property if one has to modify more than edges to make it satisfy , while preserving the degree bound. After two decades of study, a number of properties of bounded degree graphs are now known to be testable in constant time [GR02, BSS10, HKNO09, NS13], or time [GR98, GR00, CS10, KS11, NS10, CPS15, CKK*+*18, KSS18].
In particular, for the property of being -clusterable, [CPS15] gave a testing algorithm that runs in time and distinguishes -clusterable graphs from graphs that are -far from being -clusterable, for any . (Note that the algorithm rejects any graph that is far from clusterable graphs with arbitrary outer conductance.) [CKK*+*18] recently improved this algorithm by giving an algorithm for testing if a graph contains at most subsets with inner conductance at least from those that can be decomposed into at least subsets with size at least and outer conductance at most in time for any that is smaller than some constant (they also generalize their algorithm for general graphs). For the case of , i.e., testing if the graph has expansion at least , the best known algorithm can test if a graph has expansion or is -far from having expansion in time for any ([KS11, NS10] which improves upon [CS10]). Furthermore, there exists a lower bound of on the query complexity for testing expansion [GR02].
Note that both the robust clustering oracle problem and the reconstruction problem are always much harder than the property testing version (see e.g., [KPS13]). For example, in the oracle problem, we need to figure out the cluster structure of the clusterable graph, and in the local reconstruction problem, the algorithm actively repairs the input graph, while the property testing is a decision problem. Furthermore, property testing only needs to distinguish between graphs which are clusterable and those are -far from being clusterable, while both the clustering oracle and the reconstruction have to (in some sense) approximate the distance to the class of all clusterable graphs555Actually, in our setting, we are approximating the intra-perturbation distance to the class of all clusterable graphs, i.e., the minimum number of intra-cluster edges needed to be modified to obtain a clusterable graph over all possible -partitions, for some . This is in contrast to approximating the distance to all clusterable graphs, which is the minimum number of edges needed to be modified to obtain a clusterable graph.. Thus, the property testing algorithms can not be directly used to or easily modified to give a robust clustering oracle or local reconstruction algorithm. In particular, even for the case that the input graph is clusterable, one cannot use the corresponding property testing algorithm (on the clusterable graph) to answer SameCluster queries. Actually, both algorithms in [CPS15, CKK*+*18] make decisions based on some small summarizations of the input graph which are constructed by a small sample of vertices and the corresponding random walk statistics. Such small summarizations can be used to distinguish if the graph is -clusterable or is far from being -clusterable. However, if the graph is indeed -clusterable, they cannot be used to distinguish if two vertices are from the same cluster or are from two different clusters. As we mentioned before, in [CKK*+*18], evidence has been provided that in general it is not possible to use pairwise Euclidean distances between two random walk distributions to distinguish between -clusterable graphs and far from -clusterable graphs if the gap between conductances is constant.
On the other hand, property testing algorithms can always be obtained from the corresponding local reconstruction ones (which has already been noted in previous work on local reconstruction) and testing -clusterability can also be obtained from our robust clustering oracle algorithm. This is also true in our scenario since we can estimate the distance between and a clusterable graph with small additive error by sampling a constant number of vertices and running the oracle and clustering query algorithm (or the local reconstruction algorithm) on each sampled vertex to obtain the fraction of outlier vertices. We further note that if a graph is -far from any -clusterable graph, then it cannot be an -perturbation of any such clusterable graph (i.e., one has to perturb more than an -fraction of edges). Therefore, both our robust clustering oracle and local reconstructor algorithm lead to a property testing algorithm that distinguishes -clusterable graphs from graphs that are -far from being -clusterable for any , with probability at least . The running time of the algorithm is , which is optimal up to polylogarithmic factors due to the lower bound on the number of queries for testing expansion (corresponding to in our problem) [GR02].
1.5 Other Related Work
The study on local graph clustering [ST13, ACL06, AP09, OT12, AOPT16, ZLM13, OZ14] is also closely related to our work. In this framework, the goal is to find a cluster from a specified vertex with running time that is bounded in terms of the size of the output set (and with a weak dependence on ). In the scenario where both inner and outer conductance are used for measuring the quality of clusters, [ZLM13] gave a local clustering algorithm that outputs a set with conductance at most where is the target set, and is the reciprocal (e.g., ) of the mixing time of the random walk over the induced subgraph on and is the total degree of vertices in . It is also shown that the conductance guarantee is tight among (some class of) random-walk based local algorithms [ZLM13]. It might be interesting to note the logarithmic factor (i.e., ) dependency appeared in these guarantees. The performance guarantee has later been improved by [OZ14] using a flow-based local improvement algorithm that finds a set with conductance , volume and runs in time , where is the target set with . Note that the running times of these algorithms are sublinear only if the size (or volume) of the target set is small (say, at most ), while in our setting, the clusters of interest have at least linear size (for any constant ).
Fully or partially recovering the clusters in the noisy model has been extensively studied in the “global algorithm regimes”. Examples include recovering the planted partition in stochastic block model with modeling errors or noise (e.g., [CL15, GV16, MPW16, MMV16]), correlation clustering on different ground-truth graphs in the semi-random model (e.g., [MS10, CJSX14, GRSY14, MMV15]) and partitioning the graph in the average-case model [MMV12, MMV14, MMV15]. All these algorithms run in at least linear time.
Local reconstruction of some other properties have been investigated before. Such properties include expanders [KPS13], graph connectivity and diameter [CGR13], bipartite and -clique dense graphs [Bra08], geometric properties [CS11], monotone functions [ACCL08, SS10], Lipschitz functions [JR13] and low rank matrices and subspaces [DGK17]. This algorithmic framework is also closely related to local decodable codes (e.g., [STV99]) and local decompression [DLRR13]. The local reconstruction model has been generalized to local computation model by Rubinfeld et al. [RTVX11, ARVX12], and a number of problems like maximal independent set, hypergraph coloring and maximum matching have been investigated in this model [RTVX11, ARVX12, MRVX12, MV13].
Organization of the paper.
We give preliminaries in Section 2. In Section 3, we give the algorithm and the analysis for our robust clustering oracle and prove Theorem 1.3. Then, we give our local reconstruction algorithm, its analysis and prove Theorem 1.4 in Section 4. Both proofs for Theorem 1.3 and 1.4 will rely on the local mixing property of random walks in noisy clusterable graphs, i.e., Theorem 1.5, which we prove in Section 5. We conclude in Section 6.
2 Preliminaries
Let denote an -vertex undirected graph with maximum degree bounded by some constant , where . For each vertex , we let denote its degree. Throughout the paper, all the vectors will be row vectors unless otherwise specified or transposed to column vectors. For a vector , we let and to denote its norm and norm, respectively. Let denote the indicator vector of set , that is if and [math] otherwise. Let . Let denote the uniform distribution on set . For any set of vectors , we let denote the linear span of , that is . For a vector and a set , we let . For two distributions and , we let denote the total variance distance between . It is known that .
Different types of random walks on .
We will consider the following random walks.
(1) (Normal) random walk of steps. In a (normal) random walk, at each step, suppose we are at vertex , then we jump to a random neighbor with probability and stay at with the remaining probability . We stop the walk after steps. We let denote the probability vector for a step random walk starting at .
(2) Uniform averaging walk of steps. In this walk, we choose a number uniformly at random, and stop the (normal) random walk after steps. We let denote the probability vector for a uniform averaging walk of steps starting at .
(3) Uniform averaging walk of steps with two phases. In this walk, we choose two integers uniformly at random, and stop the walk after steps. We let denote the probability vector for a uniform averaging walk of steps with two phases starting at .
It is useful to note that for any two vertices ,
A simple reduction: from -bounded graphs to -regular graphs.
Given a graph with maximum degree upper bounded by , it will be very convenient to consider the -regular graph that is obtained by adding an appropriate number of self-loops (each with half weight) to each vertex so that every vertex has degree exactly . Note that the (normal) random walk on we defined above is exactly the lazy random walk of the graph . Let denote the adjacency matrix of , and let denote the normalized Laplacian matrix of . We let denote the eigenvalues of and let denote the corresponding orthonormal (row) eigenvectors. That is, . Note that the lazy random walk matrix corresponding to is . This implies that the eigenvalues of are , with corresponding eigenvectors . In particular, . Furthermore, it holds that .
Estimating reduced collision probabilities.
Both our robust clustering oracle and local reconstruction needs to invoke a procedure to estimate the reduced collision probability of two random walks [KPS13]. For a vertex , an integer and a constant , we let . For any two vertices , the -reduced collision probability of is defined as
[TABLE]
Observe that by definition of -random walks, it holds that
[TABLE]
The following lemma shows that under appropriate conditions, the reduced collision probability of two vertices can be well approximated in time.
Lemma 2.1** ([KPS13]).**
Let be two constant. Let be two vertices. There exists a procedure EstimateRCP() that takes as input a -bounded -vertex graph , vertices , parameters , and length parameter , and satisfies the following properties:
It runs in time ;
- 2)
If , then it aborts (without outputting an estimate) with probability at most ;
- 3)
If it does not abort, then with probability at least , it outputs an estimate such that
[TABLE]
For the sake of completeness, we give the description of the algorithm EstimateRCP in Appendix B.
3 Robust Clustering Oracle
In this section, we present our algorithm for constructing the robust clustering oracle and answering the clustering queries. In the preprocessing (or learning) phase, the algorithm learns the cores (corresponding to clusters in the clusterable part) of the graph. In the query phase, the algorithm checks if the queried vertex belongs to any of the learned cores or not to decide if it is an outlier or not. If not, the algorithm will find the index corresponding to the cluster that belongs to.
We will use the reduced collision probability of random walks of length for some sufficiently small constant . Such probabilities can be efficiently estimated by invoking the EstimateRCP procedure (see Section 2). The intuition is that for a typical vertex in a large cluster , the uniform averaging walk of steps from will be close to the uniform distribution on (by Theorem 1.5), which implies that for almost all of vertices , their reduced collision probability is at least .
The learning phase of the algorithm is as follows.
[TABLE]
The subroutine FindCore() is defined as follows.
[TABLE]
Note that by the above definition of cores, it holds that for any core , there exists such that and the edge weight in the clique is at least .
We need the following subroutine to answer clustering queries.
[TABLE]
Now we are ready to describe our algorithm for answering clustering queries.
[TABLE]
3.1 The Analysis of Robust Clustering Oracle
In the following, we show the performance guarantee of the above algorithm. We will use the local mixing property on noisy clusterable graphs as guaranteed in Theorem 1.5, whose proof is deferred to Section 5. Recall from the description of our algorithm that , which is a sufficiently small universal constant.
If (i.e., the noise is too much), then by our algorithm, the learning phase will output fail. Any queried vertex will be reported as Outlier.
In the following, we assume that and we prove the statement of Theorem 1.3. To do so, we first introduce the definition of strong vertices, which correspond to vertices in the clusterable part.
Definition and properties of strong vertices.
Let . Let be an -perturbation of a -clusterable graph. Recall that and denote the distribution of the uniform average walk of length and the uniform average walk of length with two phases starting from , respectively. In the algorithm, we invoke EstimateRCP with length parameter .
We let . We introduce the following definition of strong vertex for the analysis, which was inspired by the corresponding definition for noisy expander graphs in [KPS13]. The main difference here is that we carefully take the size of clusters into consideration.
Definition 3.1**.**
We call a vertex a strong vertex with respect to a subset if , and
Recall that is small sufficiently small constant, and that is the reduced collision probability of (see Section 2). We have the following properties of strong/weak vertices, which easily follows from the proof of Lemma 2 in [KPS13]. We present the proof in Appendix A for the sake of completeness.
Lemma 3.2**.**
If a vertex is strong with respect to a set with , then (1) there can be at most vertices in with ; (2) it holds that .
Furthermore, if vertices are both strong with respect to a set with , then we have that .
The correctness of the robust clustering oracle.
Now we show the correctness of the robust clustering oracle and bound the total number of vertices reported as outliers by the the algorithm. Recall that we let with for some integer , and denote the partition output by our algorithm.
Lemma 3.3**.**
Let be an -perturbation of a -clusterable graph. Then there exists a partition for some (that is independent of the order of queries), such that
- •
if , then and each is a -cluster, for any ; if , then ; and
- •
with probability at least , the partition output by the algorithm satisfies that and .
In particular, the number of vertices reported as outliers is at most .
Proof.
We first note that if , then we can simply take (and thus ) and then for any output partition of the algorithm, it holds that .
Thus, in the following, we assume that .
Let . Let be a -clusterable graph such that is an -perturbation of . Let be the corresponding -clustering of for some . That is, for each , , and one can insert/delete at most edges inside subgraphs to make all become -clusters.
Now for each set , we perform the following process on recursively. We start with and . If , and there exists a subset with and , then we update , and . We recurse until no such set can be found or . Note that by our construction, the final set satisfies that and that has inner conductance at least . Furthermore, it holds that , since right before the last update, we have that and that the final cut satisfies that , which gives that .
Now we claim that . Assume on the contrary that , i.e., . First, we note that in order to make , then we should add at least edges, where the inequality follows from the fact that which in turn is due to the fact that . Therefore, in order to make all have inner conductance at least , we have to add at least edges, which is a contradiction.
We note that since , then it holds that at least one has size at least . Now we apply Theorem 1.5 on with error parameter , , sets , such that , to obtain that for each with , there exists a subset such that and for any , and ,
[TABLE]
This further implies that all vertices in are strong with respect to , as . We also note that for each with , it holds that . Now we order such that (breaking ties arbitrarily). Let be the largest index with . Note that . We define the partition . By definition, it holds that for each , and . Note that the partition only depends on . It holds that .
We further define .
Now we show the following claim.
Claim 3.4**.**
With probability at least , for all vertices in , WhichCluster() will output a unique index if vertex for some injection .
Note that the statement of the lemma will then follow from the above claim: Let be the largest index output by the algorithm, and let , for be the partition output by the algorithm. Then by Claim 3.4, all vertices in will be correctly partitioned and
[TABLE]
Re-arranging the order of sets will complete the proof of the Lemma. Now we prove the claim.
Proof of Claim 3.4.
By the previous analysis, we have that for each such that , the number of vertices in that are not strong (with respect to ) is at most
[TABLE]
That is, for each such that , at least fraction of vertices in are strong (with respect to ).
Now let us consider the sample set . Recall that for some large constant . Let and let denote the set of vertices in that are strong with respect to . By Chernoff bound, we have that with probability at least , for any such that ,
[TABLE]
In the following, we will condition on event that the above two inequalities hold.
Now recall that , for , where is the maximum integer such that . Let denote the index such that . Thus, .
Let be two vertices in . By Lemma 3.2, we have that , and . By the assumption that and Lemma 2.1, we obtain that with probability at least , EstimateRCP will output a value that is at least . That is, with probability at least , in the similarity graph , the induced subgraph will form a complete graph with at least vertices such that for each pair , . Therefore, in our sample, the set will be recognized as a subgraph of a core (corresponding to ), which is a maximal clique with edge weight at least .
Now once a vertex is queried (for checking if it is outlier or not), then by using similar argument as above, we can guarantee that with probability at least , for all , the EstimateRCP will output satisfying that . Thus, the algorithm will detect the core (corresponding to ) for . Furthermore, for any vertex that is strong with respect to , it holds that for any with , there can be at most vertices with , this is true since the total probability mass on of the random walk distribution from is at most . This ensures that there will be a unique core corresponding to . Let denote the corresponding bijection between and the cores found by the algorithm. By union bound, we have that with probability at least , for each strong vertex , the algorithm will answer the corresponding index to the query WhichCluster().
Running time and query complexity.
Note that in the learning phase, we need to invoke the procedure EstimateRCP for times, and each invocation takes time , which in total takes time . Finding the cores in the similarly graph can be implemented by a simple greedy algorithm????, which can be implemented in time. Thus, the query complexity and running time in the learning phase is dominated by , which, by similar arguments, also upper bounds the query complexity and running time on each query vertex in the query phase.
Remark.
From Lemma 3.3 and its proof, we note that in order to guarantee that , i.e., there exists at least one good cluster , we need to set (so that ). Thus our algorithm has non-trivial guarantee only if the adversary does not perturb the graph too much. Suppose that there are ground-truth clusters and the adversary perturbs an -fraction on intra-cluster edges. In order to recover for each , a subset that is close to , then we need to require that , which can be satisfied if .
We further remark that our algorithm can only be able to (partially) recover the large clusters, say of size at least . This is the case as for any small cluster (of size ), it can be completely hidden or destroyed by the adversary. Currently, our analysis shows that our algorithm can recover the cluster of size . It will be an interesting question to design a robust clustering oracle that can recover smaller clusters (i.e., of size in the range ).
4 The Local Reconstruction Algorithm
In this section, we present our reconstruction algorithm, which will be built upon our robust clustering oracle algorithm in Section 3 and consists of two phases: the learning phase, that learns the cores (corresponding to clusters in the clusterable part) of the graph, and the query phase, which first checks if the queried vertex belongs to any of the learned cores or not, and then output its neighbors in the amended clusterable graph accordingly. We need the following tool of explicit construction of expanders.
Explicit expanders.
For any vertex set , we let denote a graph on with maximum degree at most such that for any set in with , it holds that , for some constant . It is known (see e.g., Lemma 6 in [KPS13] which builds upon [GG81]) that such an expander also exists and can be explicitly constructed in the sense that for any specified vertex , one can find all neighbors of in in time.
In the following, given a graph , we let denote an explicit expander graphs on the same vertex set as . We call vertices or -neighbors of a vertex , depending on the graph under consideration.
[TABLE]
Note that the algorithm should be implemented by first taking as input a random seed , which is fixed once for all (and used for sampling vertices in the learning phase and performing random walks), and then on any query vertex , deterministically outputting the neighborhood of in the graph . By construction, if an edge is added, then on query vertex , will be output as a neighbor of and vice versa. Therefore, the algorithm is independent of the order of queries and the answer will be globally consistent.
4.1 Analysis of the Local Reconstruction Algorithm
In the following, we show the performance guarantee of the above algorithm and prove Theorem 1.4. We first note that the running time and query complexity can be analyzed in the same way as in the proof Theorem 1.3.
It follows from the definition of that the maximum degree of is bounded by , as has maximum degree at most and for each vertex that is found to be an outlier, we will add all of its -neighbors to .
Recall from the description of our algorithm that is a sufficiently small universal constant. If (i.e., the noise is too much), then by our algorithm, the learning phase will output fail. Furthermore, on query any vertex , the query phase will output all of its and neighbors of . Thus, is a complete hybridization of and . Note that for any set , , where and denote the set of edges in and respectively. Thus, it holds that if , , where we used the fact that for any set with in , . Therefore, the resulting graph is -clusterable. Furthermore, the number of edges added to is at most as . Thus, in this case, the statement of our theorem holds.
In the following, we prove the rest properties as listed in Theorem 1.4 for the more interesting case that .
In this case, the description of the local reconstruction algorithm, the number of added edges is times the number of vertices that are reported as outliers, and thus by Lemma 3.3, is at most . Now we analyze the cluster structure of the resulting graph.
Definition and property of weak vertices.
Let . We introduce the following definitions of weak vertex for the analysis, which was inspired by the corresponding definitions for noisy expander graphs in [KPS13]. The main difference here is that we carefully take the size of clusters into consideration.
Definition 4.1**.**
We call a vertex weak vertex, if for any subset with , it holds that
In order to analyze the cluster structure of the resulting graph , we need the following property of weak vertices.
Lemma 4.2**.**
With probability at least , it holds that for any weak vertex , the algorithm will report as an outlier.
Proof.
We first show that if is weak, then for any subset with vertices, at most vertices in satisfy . This is true since otherwise, there will be more than vertices satisfy . If we let (resp. ) denote the set of vertices in such that (resp. ), then
[TABLE]
which is a contradiction. By the definitions of reduced collision probability and relations of and , we have that , and thus there can be at most vertices in with . Note that this property holds for all sets with .
For each , we let denote the set of vertices such that . Recall that , for .
If , then for all vertices , it holds that , which is a contradiction. If , then we can add arbitrarily at most vertices to to obtain a set such that , and for at least fraction of vertices in , it holds that , which is a contradiction. Therefore, it must hold that .
That is, for the weak vertex , it holds that for each , there will be at most vertices with . Thus, there will be at least vertices with . We can further guarantee that with probability at least , for any such pair , the procedure EstimateRCP (with parameter ) either aborts or outputs an estimate , for any . Finally, with probability at least , in our sample set , at least fraction of vertices satisfy that , or equivalently, less than fraction of vertices satisfy that . This implies that our algorithm will report as an outlier.
Cluster structure of .
Now we are ready to show that the resulting graph from our local reconstruction algorithm can be partitioned into at most parts, each of which has relatively large inner conductance.
Lemma 4.3**.**
Let for some sufficiently small constant . If is an -perturbation of a -clusterable graph, then the resulting graph from the local reconstruction algorithm is -clusterable.
Proof.
For analysis, we perform the following procedure on the input graph . Let . We start with the set and a partitioning of . Then if there exists a set and such that and , then we set . We repeat until no such can be found. Let denote the final partitioning of .
Note that for any , if is the subset that contains and is then split into and , then and thus and by the construction. This implies that at the end of the above procedure, it holds that .
We further note that . This is true since otherwise, in order to make become a -clusterable graph, one has to patch up at least one set to other parts, that is, we need to add at least edges, which is a contradiction to the assumption that is an -perturbation of a -clusterable graph.
Now let us consider the partition in the constructed graph . Observe that by the description of our algorithm, for any set of vertices , where and denote the set of edges in and respectively. In particular, Lemma 4.2 implies that the set of -neighbors of any weak vertex is a superset of the set of -neighbors of , as will be reported as an outlier by the algorithm and the -neighbors of will be added to .
We have the following claim.
Claim 4.4**.**
In the graph , for each , and any subset with , it holds that .
Proof.
If , then by our construction of , we have that . Thus, . Now let us consider the case that .
If there are less than fraction of vertices in are weak, then we show that . Suppose this is not the case, that is, , if we set to be a sufficiently small constant. By the proof of Theorem 4 in [KPS13] (which in turn is based on the proof of Lemma 4.7 in [CS10]), we know that for at least fraction of vertices in , the probability that a -random walk that starts at will end up in is at most . Now let be any set with . Since , it holds that . Thus, we have that . This gives that , which implies that such a vertex is weak. Thus, contains at least fraction of weak vertices, which is a contradiction. This implies that .
If there are more than fraction of weak vertices, denoted by , in , then the number of -neighbors of in is at least . Since all these -neighbors are also in , we have that the number of vertices outside of is at least . Since we add all the edges in that are incident to to , we have that the number of edges crossing in is at least , and thus .
Now based on the partition as constructed above, we find a new partition of such that each part has large inner conductance. We start with the partition as constructed above and perform the following operations. If there exist , satisfies that , and that , then we set and . We repeat until the condition is violated.
Note that the above process always terminates in a finite number of steps since the number of crossing edges, i.e., , always decreases in each iteration. Furthermore, we observe that at the end of the process, for any , and any set with , . Therefore, . This implies that for each , .
5 Local Mixing Property of Random Walks on Noisy Clusterable Graphs: Proof of Theorem 1.5
In this section, we give the proof of Theorem 1.5. To do so, we first give a property of random walks on clusterable graphs (without noise).
5.1 Local Mixing Property of Random Walks on Clusterable Graphs
We will first prove a mixing property of random walks on a clusterable graph, which says that in a clusterable graph, a random walk of appropriate length starting from a typical vertex of a large cluster will mix well inside the corresponding cluster. By a simple reduction (see Section 2), it suffices to consider a corresponding weighted -regular graph for any -bounded graph.
Theorem 5.1**.**
Let . Let for some sufficiently small constant . Let be a weighted -regular and -clusterable graph with underlying clusters for some . Then for each with , there exists a subset such that , and for any , and , it holds that
[TABLE]
We remark that [ST13] and [AOPT16] gave analysis for upper bounding the probability that a random walk of length from a typical vertex in a set with small conductance will escape the set , and lower bounding the probability that the walk from of length stays inside , respectively. It is unclear if one can use their analysis to prove the above theorem. In the following, we prove Theorem 5.1 by using some strong spectral property of clusterable graphs, i.e., the spectral gap between and for some , and the closeness of the space spanned by the first eigenvectors and the space spanned by the indicator vectors of clusters. More precisely, we need the following tools.
Lemma 5.2** (Lemma 5.2 in [CPS15] and Lemma 10 in [CKK*+*18]).**
Let be a weighted -regular and -clusterable graph with underlying clusters for some . Then and .
Fact 5.3**.**
It holds that , for any .
The following is a direct corollary of a structural result due to [PSZ17] that relates the first eigenvectors of the Laplacian to the normalized indicator vectors of some -partition of the graph. Recall that is the eigenvector corresponding to the -th smallest eigenvalue of the Laplacian of .
Theorem 5.4**.**
Let for sufficiently small constant . Let be a weighted -regular and -clusterable graph with underlying -clusters for some . Let . Then there exist orthonormal vectors and a constant , such that
[TABLE]
Proof.
Let , where the minimum is taken over all -partitions . It is proven in Theorem 1.1 of [PSZ17] that if for some constant , then there exist orthonormal vectors such that
Note that by definition, . In addition, by Lemma 5.2, it holds that . Furthermore, since , it holds that as is sufficiently small constant. This then implies that for some constant .
Now we are ready to prove Theorem 5.1. We first provide a high level idea. We will bound the -norm distance of the random walk distribution and the uniform distribution over the cluster that contains , i.e., . In order to do so, we note that by Theorem 5.4, the vector , which is a scale of the indicator vector of , lies in a space that can be well approximated by the space of the first (where is the number of clusters) eigenvectors of matrix . Using this, we show that the projection of on the space spanned by the first eigenvectors is small. Furthermore, by Lemma 5.2, is large, and thus the length of the projection of on the space spanned by the remaining eigenvectors is dominated by , which is also small for appropriately chosen . Now we give the details.
Proof of Theorem 5.1.
For any vertex , we let . We first note that . Therefore, by the averaging argument, there can be at most vertices with .
Note that by the precondition of the Theorem, it holds that . Let and be the vectors as defined in Theorem 5.4. Let . Then by applying Theorem 5.4 with graph , we have that
[TABLE]
Again, by the averaging argument, there can be at most vertices with .
Now let us define . Note that for any with , it holds that .
Let us consider any vertex . Since , it holds that
[TABLE]
where the last equation follows from the fact that have the same linear span as vectors , which in turn follows from the properties of as guaranteed by Theorem 5.4.
Recall that . We let . Thus, we have that
[TABLE]
Now observe that
[TABLE]
Therefore,
[TABLE]
where the last inequality follows from our setting that and , where is some sufficiently small constant.
Therefore, it holds that .
5.2 From Clusterable Graphs to Noisy Clusterable Graphs
Now we analyze the random walk on a noisy clusterable graph , for which we use an induced Markov chain introduced in [KPS13] and some property of stopping rules of Markov chains [LW97].
A tool: stopping rules of Markov Chains.
Consider a finite, irreducible, discrete time Markov chain on the state space with stationary distribution . For any distribution , we let denote the distribution of a -step walk on the Markov chain with initial distribution . A stopping rule of the Markov chain is a rule that observes the walk and decides whether to stop or not on the basis of what has been observed so far (see e.g., [LW97] for formal definition). Given a starting distribution and a target distribution , we say that a stopping rule is a stopping rule from to if the initial state is drawn from and the final state is governed by . Let denote the expected length before halts. For any two distributions and , we let denote the minimal expected length among all stopping rules from to .
Let denotes the distribution of a uniform average walk of length with initial distribution . The following lemma was proved by Lovász and Winkler.
Lemma 5.5** ([LW97]).**
For any distribution , and any subset ,
[TABLE]
where denotes the probability vector of an step random walk on the Markov chain with initial distribution .
We remark that the above inequality was not explicitly stated in [LW97], while the proof of Lemma 4.22 in [LW97] directly implies the above Lemma.
An induced Markov chain.
Let be a -bounded graph. Let be the Markov chain corresponding to the (normal) random walks on the input graph . For simplicity, we assume is irreducible (i.e., the graph is connected). By definition, the stationary distribution of is the uniform distribution on , that is . Let denote a (large) subset of and let . Now we describe the new Markov chain , that has been considered in [KPS13], with state set as follows. For any two vertices , the transition probability in is the sum of , i.e., the transition probability from to in , and the probability that is equal to the total probability of all length walks from to all of whose states, except for the end points and are in , for any integer . That is, . The chain is formally constructed by first retaining the original transition in between and then adding new transitions with transition probability for any , for any .
We note that the chain is the stochastic complement of with respect to set [Mey89]. Let \mathbf{P}=\bigl{(}\begin{smallmatrix}\mathbf{P}_{D}&\mathbf{P}_{1}\\ \mathbf{P}_{2}&\mathbf{P}_{B}\end{smallmatrix}\bigr{)} denote the transition probability matrix underlying . We have the following lemma regarding the transition probability matrix underlying .
Lemma 5.6** ([Mey89]).**
The Markov chain is irreducible and aperiodic. Furthermore, its transition probability matrix is .
It is known (see e.g., [Mey89] and [KPS13]) that, the stationary distribution in is given by the vector such that for any .
Now let us consider a vertex and an integer that will be specified later. Let denote the distribution of a random walk of length starting from in . Consider the stopping rule that stops the walk in as soon as it has taken steps in , that is, is a stopping rule from to . Recall that denotes the expected number of steps the walk takes starting from before being terminated by the stopping rule . The following lemma has been proven in [KPS13].
Lemma 5.7** ([KPS13]).**
There exists a set with such that for any , . In particular, for any such vertex , .
Now we use the above induced chain to analyze the random walks on noisy clusterable graphs. Let be a graph with an -partition , satisfying the precondition of Theorem 1.5. We let denote the union of all ’s with , that is, and . We consider the induced Markov chain with state set .
Recall that we let denote the adjacency matrix of the -regular graph corresponding to (see Section 2.) Then the transition probability matrix is . If we let \mathbf{A}=\bigl{(}\begin{smallmatrix}\mathbf{A}_{D}&\mathbf{A}_{1}\\ \mathbf{A}_{2}&\mathbf{A}_{B}\end{smallmatrix}\bigr{)}, then by Lemma 5.6, the transition probability matrix of is
[TABLE]
If we let denote the (weighted) -bounded graph with adjacency matrix , then by the above analysis (and the fact that [Mey89]), corresponds to the lazy random walk on the graph .
In the following, we show that is a clusterable graph with clusters , which will imply that the chain has the nice local mixing property as guaranteed by Theorem 5.1. Then we can use the stopping rules to relate the chains and .
The following lemma shows that if we construct as above for the graph that satisfies the precondition of Theorem 1.5, then is -clusterable. This is trivial for the case of (as in [KPS13]), as the inner conductance of any set is monotonically increasing. However, for general , we need to deal with the difficulty of bounding the outer conductance of potential clusters, as the outer conductance of any set is also monotonically increasing due to our construction.
Lemma 5.8**.**
Let be a -bound graph with an -partition , such that . Furthermore, each can be partitioned into two subsets and such that . Let and . Let be the weighted graph corresponding to the Markov chain on constructed as above. Then in the graph , each has the inner conductance at least and outer conductance at most .
Proof.
We first consider the inner conductance of in . Let with . By the fact that the adjacency matrix of is , it holds that . This implies that the inner conductance of in is at least .
To bound the outer conductance of in , we instead bound the outer conductance of in the Markov chain , which is defined to be , where denotes the transition probability from to in the Markov chain . Note that by our definitions, .
Recall that and that the transition probability matrix of is given by Equation (1). Then we have that
[TABLE]
where last equation follows from the Neumann Series .
We bound each term in the right hand side of the above inequality as follows. First, we have that
[TABLE]
Furthermore, we observe that is exactly the number of paths that start from , then go to a vertex , and then move to . Thus,
[TABLE]
Similarly, for each , is exactly the number of paths that start from , then go to a vertex , and move inside for the next steps until some vertex , and then move to . We have that
[TABLE]
where in the first inequality, the third summation is taken over all possible paths from to some vertex , such that the length of is and all vertices on belong to ; in the second inequality, we used the fact that the number of such paths is at most and each vertex has degree at most .
Thus,
[TABLE]
By the above inequalities (2),(3),(4), we obtain that
[TABLE]
where in the second to last inequality, we used the assumption that , which gives that .
Therefore, .
Now we are ready to prove Theorem 1.5.
Proof of Theorem 1.5.
Let . Let . Then it holds that , and . We consider the induced Markov chain on as above. By Lemma 5.8, the corresponding -bounded weighted graph is -clusterable. In particular, and for any .
Let be an integer that will be specified later. For any , we let being the probability distribution of an step random walk starting from in the induced Markov chain . Let be the stopping rule from to which is obtained by stopping the random walk that starts at in as soon as it has taken steps in . Let be the set guaranteed by Lemma 5.7 such that and for any ,
[TABLE]
Now we set and thus . We then apply Theorem 5.1 on with -clusters and , , to obtain that for any with ,there exists a set with such that for any and , it holds that . This implies that
[TABLE]
Now we set . Then it is guaranteed that for any with , . Thus, for any , both inequalities (5) and (6) hold.
Now let us consider an arbitrary . Let and . By the precondition of the Theorem, we have that . We further recall that denotes the distribution of a uniform average walk of length with initial distribution in the original chain . By applying Lemma 5.5 with and distribution , we obtain that for any ,
[TABLE]
where denotes the distribution of an step random walk on with initial distribution , that is . (Here we slightly abuse the notation and use it to denote the distribution on by adding zero coordinates corresponding to vertices in ). This further implies that for any set and any ,
[TABLE]
Therefore,
[TABLE]
where the last inequality follows from inequality (5). Now recall that denotes the transition probability matrix of the random walk. We will show the following claim.
Claim 5.9**.**
For any , it holds that
Assuming that the above claim holds, we have that for any ,
[TABLE]
where the last inequality follows from Ineq. (6) and Claim 5.9. This, together with inequality (7), gives that
[TABLE]
This will then finish the proof of the theorem.
Now we give the proof of Claim 5.9.
Proof of Claim 5.9.
For notational simplicity, we let . We write , where and () denote the -th eigenvalue of , respectively. Let . Note that .
Note that
[TABLE]
which gives that . Thus, , or equivalently,
Let , where . Then we have that Thus, which gives that
[TABLE]
Now we have that
[TABLE]
where we used our choice of parameters which satisfy that and .
On the other hand, if we let denote the diagonal matrix such that if and [math] otherwise, then by Proposition 2.5 in [ST13], it holds that for any ,
[TABLE]
This gives that
[TABLE]
Finally, by the above calculations, we have that
[TABLE]
This finishes the proof of the Claim.
This finishes the proof of Theorem 1.5.
6 Conclusions
We gave the first robust clustering oracle and local filter for reconstructing the cluster structure of bounded degree graphs. Both algorithms run in sublinear times. To design and analyze our algorithms, we formalized and proved a new behavior of random walks in a noisy clusterable graph: a random walk of appropriately chosen length from a typical vertex in a large cluster of the clusterable part will mix well in the corresponding cluster, which might be of independent interest.
It will be an interesting open question to design a local reconstruction algorithm that outputs a clusterable graph with better cluster-quality guarantee, especially to remove the gap between the inner conductances of the original graph and the corrected graph from our current result. In the property testing setting, such a gap was successfully closed, for both testing expansion ([CS10] vs. [KS11, NS10]) and for testing -clusterability ([CPS15] vs. [CKK*+*18]). However, for the local reconstruction setting, we even do not know how to remove such a logarithmic gap for reconstructing noisy expander graphs (i.e., ). As noted in [KPS13], for the case , one already needs to have more refined definitions of strong/weak vertices and much stronger results about random walks in noisy expander graphs. Removing the logarithmic gap from our result for locally reconstructing cluster structure for general can be as hard, if not harder. Similar question can be asked for removing the gap between the inner and outer conductance of the input instance of our robust clustering oracle. As we mentioned before, there is evidence in [CKK*+*18] showing that this is difficult (for distribution distance based algorithms).
Acknowledgements.
We are thankful to anonymous reviewers of FOCS 2018 and STOC 2019 for valuable comments.
Appendix A Proof of Lemma 3.2
Proof of Lemma 3.2.
First, note that if there are more than vertices in satisfying that , then , which contradicts to the fact that is strong with respect to .
Second, by the definition of the set and the fact that , there can be at most vertices in , and thus there are at least vertices such that . Thus
[TABLE]
where in the second inequality we used the fact that as .
Finally, since is strong with respect to , there are at least vertices such that . The same is true for . Thus, there are at least vertices such that . Again, by the fact that , we have that
[TABLE]
This finishes the proof of the Lemma.
Appendix B Description of the Algorithm EstimateRCP
In the algorithm, is a sufficiently large constant.
[TABLE]
[TABLE]
Appendix C Further Guarantees on the Locally Reconstructed Graph
In the following, we show that by sacrificing the inner conductance quality, we can also find a clustering of the reconstructed graph with small outer conductance.
Lemma C.1**.**
Let . If is an -perturbation of a -clusterable graph, then the resulting graph from the local reconstruction algorithm is -clusterable, for any .
Proof.
We start with the -clustering of that is guaranteed from Lemma 4.3. Let be a partition satisfying that . Let . We next carefully merge some of these clusters so that each part of the final partition will have both inner conductance at least and outer conductance at most .
If there exists such that with , then we merge and to obtain a new cluster . We repeat until the condition is violated.
Note that this process always terminates as each time the number of clusters decrease by . Furthermore, note that after termination, each cluster has outer conductance at most by construction. Now we show that in each iteration, the merged still has large inner conductance. Let with . Let and . Note that it can not happen simultaneously that and . Now we have the following cases.
- •
If both and , then
[TABLE]
- •
If , then .
If , then as otherwise and , a contradiction. Then . Thus there will be at least edges between and . Thus . 2. 2.
If , then . Therefore, .
- •
If , then it must hold that .
If , then . Thus . 2. 2.
If , then . If , then , then . Otherwise, , then . Thus .
From the above analysis, we know that if both and , then after merging and , the resulting cluster has inner conductance at least . Since there will be at most iterations (or merges), we know that in the final partition , each part has outer conductance at most and inner conductance . This proves the statement of the lemma.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ABJ 18] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Latin American Symposium on Theoretical Informatics , pages 14–27. Springer, 2018.
- 2[ABJK 18] Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA , pages 40:1–40:21, 2018.
- 3[ACCL 08] Nir Ailon, Bernard Chazelle, Seshadhri Comandur, and Ding Liu. Property-preserving data reconstruction. Algorithmica , 51(2):160–182, 2008.
- 4[ACL 06] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06) , pages 475–486. IEEE, 2006.
- 5[AKBD 16] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in neural information processing systems , pages 3216–3224, 2016.
- 6[AOPT 16] Reid Andersen, Shayan Oveis Gharan, Yuval Peres, and Luca Trevisan. Almost optimal local graph clustering using evolving sets. Journal of the ACM (JACM) , 63(2):15, 2016.
- 7[AP 09] Reid Andersen and Yuval Peres. Finding sparse cuts locally using evolving sets. In Proceedings of the forty-first annual ACM symposium on Theory of computing , pages 235–244. ACM, 2009.
- 8[ARVX 12] Noga Alon, Ronitt Rubinfeld, Shai Vardi, and Ning Xie. Space-efficient local computation algorithms. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms , pages 1132–1139. Society for Industrial and Applied Mathematics, 2012.
