Multi-way sparsest cut problem on trees with a control on the number of parts and outliers
Ramin Javadi, Saleh Ashkboos

TL;DR
This paper introduces a polynomial-time dynamic programming algorithm for a generalized multi-way sparsest cut problem on trees, balancing cluster edge expansion and outlier control.
Contribution
It extends the sparsest cut problem to multiple clusters with outlier constraints on trees, providing an efficient solution despite NP-hardness in general.
Findings
Polynomial-time algorithm for weighted trees with connected subgraph restriction
Algorithm runs in O(k^2 n^3) worst case complexity
Linear time complexity when clusters and outliers are bounded by constants
Abstract
Given a graph, the sparsest cut problem asks for a subset of vertices whose edge expansion (the normalized cut given by the subset) is minimized. In this paper, we study a generalization of this problem seeking for disjoint subsets of vertices (clusters) whose all edge expansions are small and furthermore, the number of vertices remained in the exterior of the subsets (outliers) is also small. We prove that although this problem is hard for trees, it can be solved in polynomial time for all weighted trees, provided that we restrict the search space to subsets which induce connected subgraphs. The proposed algorithm is based on dynamic programming and runs in the worst case in , when is the number of vertices and is the number of clusters. It also runs in linear time when the number of clusters and the number of outliers is bounded by a constant.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Multi-way sparsest cut problem on trees with a control on the number of parts and outliers
Ramin Javadi Corresponding author, Department of Mathematical Sciences, Isfahan University of Technology, P.O. Box: 84156-83111, Isfahan, Iran. School of Mathematics, Institute for Research in Fundamental Sciences (IPM), P.O. Box: 19395-5746, Tehran, Iran. Email Address: [email protected].This research was in part supported by a grant from IPM (No. …).
Saleh Ashkboos Department of Computer Engineering, Isfahan University of Technology, P.O. Box: 84156-83111, Isfahan, Iran. Email Address: [email protected].
Abstract
Given a graph, the sparsest cut problem asks for a subset of vertices whose edge expansion (the normalized cut given by the subset) is minimized. In this paper, we study a generalization of this problem seeking for disjoint subsets of vertices (clusters) whose all edge expansions are small and furthermore, the number of vertices remained in the exterior of the subsets (outliers) is also small. We prove that although this problem is hard for trees, it can be solved in polynomial time for all weighted trees, provided that we restrict the search space to subsets which induce connected subgraphs. The proposed algorithm is based on dynamic programming and runs in the worst case in , when is the number of vertices and is the number of clusters. It also runs in linear time when the number of clusters and the number of outliers is bounded by a constant.
- Key words: sparsest cut problem, isoperimetric number, Cheeger constant, normalized cut, graph partitioning, computational complexity, weighted trees.
- Subject classification: 05C85, 68Q25, 68R10.
1 Introduction
Data clustering is definitely among the main topics of modern computer science with an indispensable role in data mining, image and signal processing, network and data analysis, and data summarization (e.g. see [13] and references therein). Considering the current status of data science, one may name some fundamental challenges in this field, among many others, as follows:
- •
Clustering huge and usually high-dimensional data.
- •
Clustering in presence of outliers and anomalies.
- •
Clustering non-geometric (usually non-Euclidean) data.
- •
Clustering with no prior information about the number of clusters or other features of data (as model of the source etc.).
Needless to say, in each case, efficiency and time-complexity of the proposed algorithms are global parameters with a decisive role in applicability.
The subject of this article falls into the setup of clustering in an unsupervised and static graph-based data presentation. It is instructive to note that the graph-based approach essentially provides data presentation in a very general (not necessarily Euclidean) setting in terms of similarity kernels. In this respect, one of the main well-studied criteria is the “sparsest cut problem” which apart from tremendous real-world applications in the context of spectral clustering (see e.g. [21, 19]), has played a crucial role in the development of many subjects in theoretical computer science (see e.g. [22, 7]).
Our main objective in this article is to improve this approach, which is essentially based on solving a suitable subpartitioning problem on a corresponding minimum spanning tree, by providing an algorithm that not only gives rise to a fast clustering procedure, but also provides good control on determining the number of clusters and outliers. The procedure is based on a dynamic programming which runs in the worst case in , where is the data size and is the number of clusters. Also, the algorithm runs in linear time in terms of the data size when the number of clusters and the upper bound on the number of outliers are both constant (which is the case in the most prevalent applications). To the best of our knowledge, the partitioning problem solved by the proposed algorithm (Algorithm 3) is among the most challenging problems in this literature which is efficiently solvable, while we will also dwell on some important consequences in what follows.
1.1 A formal setup and the main result
Partitioning problems are essentially as old as graph theory itself, with wide applications in science and technology. In particular, one may refer to the unnormalized partitioning problems that usually are considered as different versions of minimum cut problems as well as the normalized versions which are more plausible in real applications, however, are much harder to resolve. One of the main problems in the category of normalized cut criteria is the sparsest cut problem which is defined as follows. Given a graph , the sparsest cut problem asks for a cut (a subset of vertices) which has the minimum edge expansion, i.e.
[TABLE]
where and is the set of all edges with exactly one end in . The sparsest cut problem is known to be an hard problem on general graphs [21, 18]. Efforts to find an efficient algorithm for a good approximation of this problem have triggered off the development of many subfields of computer science and have had a significant influence on algorithm design and complexity theory. It is amazing to see that recent advances in computer science have given rise to a culmination of ideas not only from the classical graph theoretic point of view but also from the more geometric point of view discussed in the theory of Riemannian manifolds and stochastic processes [23]. Up to now, the best known approximation result for the sparsest cut problem is due to Arora, Rao, and Vazirani [3] which gives an approximation algorithm.
It is also worth noting that the invariant defined in (1) has an intimate connection with the second eigenvalue of the associated Laplacian operator. In fact, relaxation of the minimization problem in (1) to the Euclidean norm for real functions (i.e. changing the edge expansion to the Euclidean -norm of the gradient of real functions which is the energy representable by the Laplacian operator) gives rise to an eigenvalue problem which is efficiently solvable, while estimating the approximation ratio of this relaxation has led to some fundamental contributions (e.g. see [1, 2]). These relations, known as Cheeger’s inequalities, also exert considerable influence over constructing the expander graphs as well as the study of the mixing time of Markov chains (see e.g. [15, 14]). In general, although the motivating problems in these fields of study are usually different, the synergistic effect of methods and techniques have flourished into one of the most active and productive topics in mathematics and computer science.
Recently, some generalizations of the sparsest cut problem have been studied in the literature. Here, we study a generalization which extends two-way partitioning into way connected subpartitioning and allows some vertices to lie outside the parts.
To formulate the problem precisely, let us first fix our notation and terminology. We assume that the data is given as a simple and finite weighted graph in which and are the vertex and edge weight functions, respectively. Note that in the literature close to applications the function is sometimes referred to as the kernel or the similarity, while from a geometric point of view the graph can also be considered as a discrete metric-measure space, where the distance function is usually chosen to be proportional to some inverse function of . In this setting, by an unweighted graph we mean a graph in which all the vertex and edge weights are equal to .
Given a graph and a subset of vertices , the edge exapnsion or the conductance of , is defined as
[TABLE]
where,
[TABLE]
From a geometric point of view, the conductance can be interpreted as a normalized norm of a gradient function or a normalized energy (e.g. see [5, 6] for more on the geometric interpretations). The set is defined to be the set of all -subpartitions of , in which ’s are nonempty disjoint subsets of . The residue of a subpartition is defined to be the set . The set of all -partitions of , which is denoted by , is the subclass of containing all -subpartitions for which (i.e. ). A subpartition (or a partition in particular) is said to be connected if the subgraph induced on each of its parts is a connected subgraph of . A generalization of the sparsest cut problem can be formulated as follows.
Definition 1**.**
Given a weighted graph and a positive integer , , the th isoperimetric number is defined as,
[TABLE]
Furthermore, considering the partitions, the th minimum normalized cut number is defined as,
[TABLE]
A vertex is called a -outlier, if there exists a minimizing subpartition achieving , while lies in its residue. It is well-known that (see [8]) and the common value is usually called the Cheeger constant or edge expansion in the literature.
In this regard, Louis et al. in [17] provide a polynomial time approximation algorithm which outputs a -partition of the vertex set such that each piece has expansion at most times (for every positive number ). Also, in [16], higher-order Cheeger’s inequalities have been proved which relate the above parameters to the eigenvalues of the associated Laplacian Matrix (see also [8, 10]).
Prior to formulating our problem, let us discuss some facts. First, one may note that as an imprecise rule of thumb, changing the cost function of a partitioning problem, from the normalized form to the unnormalized form, from partitions to subpartitions, or from the mean (i.e. -norm) to the max (i.e. -norm) generally makes the problem more tractable in the sense that finding more efficient algorithms to solve the problem become more probable. One of our major observations in this article is the fact that the restriction of the search space to “connected” subpartitions reduces the complexity of the problem too. In particular, this distinction is much comprehensible when the graph is a tree where the restriction on subpartitions to be connected reduces the complexity of the problem from hard to polynomial time. Also, note that this restriction is to the best of our advantage in the sense that a cluster is more expected to be represented by a connected subgraph than a disconnected one (based on intra-similarity of the objects within a cluster). Hence, as far as clustering is concerned, this can be considered as an acceptable assumption. As a matter of fact, in what follows, we show that such a change to the better will give rise to an efficient algorithm for clustering with a control on the number of parts and outliers.
We denote the main problem, i.e. the multi-way sparsest cut problem with a control on the residue number, by the acronym “MSC problem” which is defined as follows. **MSC Problem.
**
INSTANCE:
A weighted graph , nonnegative integers and and a positive rational number .
QUERY:
Does there exist a -subpartition of such as such that and its residue number is at most , i.e. ?
The MSC problem is known to be a hard problem even when the graph is of its simplest form, i.e. a tree. When the graph is a tree, it is proved in [9] that MSC problem is complete even when the tree is unweighted and is constant (e.g. ). Nonetheless, it is shown there that the problem is solvable in linear time for weighted trees when we drop the restriction on the residue number (i.e. ). An improvement of this result has effectively been applied to real clustering problems for large data-sets [11].
The main contribution of this article (Algorithm 3) is to show that although MSC problem is complete for trees, it becomes tractable when the search space is restricted to connected subpartitions. In other words, the following problem abbreviated by CMSC can be solved in polynomial time for weighted trees. **CMSC Problem.
**
INSTANCE:
A weighted graph , nonnegative integers and and a positive rational number .
QUERY:
Does there exist a connected -subpartition of such as such that and its residue number is at most , i.e. ?
This result along with the fact that the minimum spanning tree of a geometric metric-measure space inherits a large part of the geometry of the space, can lead to a good approximation for MSC problem for general graphs. This can justify the importance of the problem on weighted trees when applications are concerned. Let us consider some consequences of this result.
Firstly, note that given a weighted tree and integers and , finding the minimum number for which there exists a connected subpartition with the residue number at most and (as well as finding the minimizing subpartition) can be done in polynomial time by applying our algorithm iteratively along with a simple binary search.
Secondly, given a weighted tree and numbers (the worst edge expansion of the clusters), we can obtain a number , denoting the maximum number of parts for which the answer to CMSC problem is positive. This by itself is an important piece of information when one considers the large existing literature discussing how to determine the number of clusters for a clustering algorithm (e.g. see [20] for -means).
Thirdly, from another point of view, CMSC problem can be considered as a problem of outlier-robust clustering where a solution will provide information on the number of outliers. It is well-known that detection of outliers and anomalies in data-sets are among the most challenging problems in the field, not just because of the hardness of the problem itself, but since the concepts themselves are quite fuzzy and depend on many different parameters as scaling or distribution of the source (e.g. see [4, 12] for the background). These facts, and in particular, lack of a universal sound and precise definition, is among the first obstacles when one is dealing with these kinds of problems. In [11] some evidence has been discussed that how the data remained in the exterior of the clusters in MSC problem can be justified to be actual outliers in some sense.
Finally, our method can be extended to handle some more general semi-supervised settings where a number of training samples are given by the user which are forced or forbidden to lie in outliers (see Section 4).
The organization of forthcoming sections is as follows. In Section 2, we give required definitions and notations as well as the lemmas which justify our algorithm. In Section 3, we present the main algorithm and explain how it can find the optimal subpartition. We also compute the time complexity of our algorithm. Finally, in Section 4, we discuss some extensions which handle more realistic models.
2 Preliminaries
Let be a rooted tree with root . There is a natural partial order induced through the root on the vertices and edges of defined as for two vertices and whenever there is a path in starting from and ending at which contains . Similarly, for two edges and whenever there is a path in starting from and containing and such that is closer than to on . In this setting, note that for any there exists a unique minimal vertex , with and an edge , where and are called the parent vertex and the parent edge of , respectively (and also is called the child of ). Also, for a given edge with we may refer to and , intermittently. For some technical reasons, we add one new vertex to and connect it to and define the parent edge of , , as the edge . Also, we set .
If is a subset of edges of , then is the set of maximal elements of with respect to the natural partial order of . Given a vertex with the parent edge , the subtree refers to the subtree induced on the set . Therefore, .
Let be a weighted tree and be a fixed positive number. For every integer , define to be the class of all -subpartitions such that for each , and the induced subgraph of on is connected (i.e. is a subtree of ). Also, given a subpartition , its residue set is defined as . We also define,
[TABLE]
In the following we describe the idea that our algorithm is based on and also prove the correctness of the algorithm. First, note that since we are looking for subsets with small edge expansion, when we cut an edge , the subset containing sustains a loss in its edge expansion. The cause of this deficiency is that the numerator of the edge expansion is added by and the denominator is subtracted by . With this intuition, for every edge , define
[TABLE]
Now, let and be two nonnegative integers and for every integers , and vertex , define to be the set of all -subpartitions in such that and and for each , we have . For each such subpartition , let . Note that any pair of edges in are incomparable and define,
[TABLE]
We will shortly see that minimizing the edge expansion , in some sense, is equivalent to minimizing (see (6)). Thus, define,
[TABLE]
On the other hand, for every integers and and vertex , define to be equal to if there exists a connected -subpartition such that and and it is equal to [math], otherwise. Note that, although ’s are subsets of , is computed in the whole tree . Also, note that for every vertex and integer , we have
[TABLE]
In fact, our main goal is to compute the parameter , since evidently the answer to CMSC problem is yes if and only if . In the sequel, we are going to show that the parameters and can be computed recursively in a breath-first scanning of vertices towards the root. First, in the following, we explain how one can compute recursively in terms of the values , . For this, let be fixed and given a vertex , let be an ordering of all of its children. Now, for every integers , , define
[TABLE]
and for every , define
[TABLE]
In the following lemma, we show how one can use the recursion in (5) to compute the function .
Lemma 2**.**
Let be a vertex in a rooted tree , be a number and be two integers. Also, let be the children of in . For every integers , , if and only if either , or .
Proof.
Suppose that and let be a connected subpartition where and . First, assume that . Thus, itself can be partitioned into connected subpartitions such that , for some integers , where . Also, let . Therefore, by definition and . Thus, again by definition . Next, suppose that and so, without loss of generality, assume that . Then,
[TABLE]
Therefore, . This implies that if , then either , or .
Now, suppose that . Then, there exist integers and such that , and , for all . Thus, for each , there exists such that and . Define . Thus, and . Hence, .
Finally, suppose that . Also, let be a minimizer with . Then, by definition, for every , and and by (6), . Hence, . This completes the proof. ∎
As we see in Lemma 2, in order to obtain the value of , we require to have the value of . In the next step, we show that given and , how one may compute efficiently for all vertices and integers , . For this, let be fixed and given a vertex , let be an ordering of all of its children. Now, for every integers , and , define
[TABLE]
Also, define
[TABLE]
and for every , define
[TABLE]
The following lemma shows how to compute the function using recursion (9).
Lemma 3**.**
Let be a rooted tree, be a number and be two integers. Then, for every vertex with children and every integers and , we have
[TABLE]
Proof.
We prove the lemma by induction on the number . Let be a -subpartition. First, suppose that . If , then and , so . Also, if , then and . Therefore, as in (7) and (8).
Now, suppose that . Let and and , and be the values of for the trees , and , respectively. Also, let , and let (resp. ) be the number of sets which intersect (resp. ). Then, evidently we have (note that intersects both and ) and . Therefore,
[TABLE]
On the other hand, by the induction hypothesis, we have and . Hence, by (9), we have and we are done. ∎
3 The algorithm
In this section, using Lemmas 2 and 3, we provide an algorithm to solve the CMSC problem for all weighted trees. The cores of the algorithm are two dynamic programmings. The final solution to the problem is given in Algorithm 3 which scans the vertices in a BFS order towards the root and computes recursively the values of and , for and . The structure of Algorithm 3 which deploys Algorithms 1 and 2 as two subroutines, is as follows.
First, for all leaves (vertices with no children), it computes the values of and (Lines 8-13 in Algorithm 3). Next, for a vertex , with children , according to Lemma 3, it applies a dynamic programming (Algorithm 1) based on the recursion given in Equations (8) and (9), to obtain the value of , assuming the values of and are given. Finally, according to Lemma 2, it applies another dynamic programming (Algorithm 2) based on the recursion given in (5) to obtain the value of , assuming the values of and are given. The backtracking ends up outputting the value of which is equal to if and only if there exists a connected -subpartition with and . This completes the solution.
3.1 Time complexity
The time complexity of the provided algorithms can be computed as follows. In Algorithm 1, Lines 2-12 can be done in . Also, Lines 14-25 can be performed in . In Algorithm 2, Lines 2-9 run in and Lines 11-26 run in . Hence, the runtime of Algorithm 3 is in . Since in real applications, the values of and are mostly much smaller than , we can assume that the algorithm runs in linear time with respect to the number of nodes.
3.2 Constructing the optimal subpartition
Now, we show that during the execution of Algorithm 3, how one can construct a subpartition with and (if there exists). Let and be fixed and for every vertex and and , if , then let be a subpartition in such that and . Also, if , let . Then, the subpartition is what we are looking for. Also, let be a subpartition in which minimizes (3).
Now, let be a vertex with children . First, according to Algorithm 2 and assuming that we have all the subpartitions and , we explain how to obtain . For this, throughout the execution of Algorithm 2, in Line 5, if , then set , otherwise set . Also, in Line 10, if , then set and in Line 18, if , then set . Finally, in Line 27, if , then set .
Next, according to Algorithm 1 and assuming that we have all the subpartitions and , we explain how to obtain . First, throughout the execution of Algorithm 1, in Line 5, if and , then set , otherwise let be the subpartition obtained from by adding the vertex to the set containing . Also, in Line 13, set . Next, in Line 20, if , then let be obtained from the disjoint union of and by merging two sets containing the vertex . Finally, in Line 26, set .
4 Towards more extensions
In this section, we show that our presented scheme can be generalized to solve the following more realizable problems efficiently:
Solving CMSC problem on trees with potentials. 2. 2.
Solving CMSC problem on forests. 3. 3.
Solving the following semi-supervised problem: Given a weighted graph (not necessarily a forest), two disjoint subsets , rational number and integers , such that the induced subgraph of on is a forest. Does there exist a connected subpartition such that , , and ?
In the following, we elaborate on the modifications that should be made to tackle the above settings.
In the setting of trees with potentials, each vertex is endowed with a potential weight, say , which is a nonnegative number and the goal is to determine whether there exists a connected -subpartition such that
[TABLE]
and . We can extend our method to solve this problem using Algorithm 3. First, for each edge , amend the definition of in (2) as follows
[TABLE]
Also, define the functions and analogously. Next, with a similar argument as in Lemma 2, one may prove that if and only if either , or . Moreover, Lemma 3 is still valid. So, we should just change Line 1 in Algorithm 1 and Line 5 in Algorithm 2, accordingly and then Algorithm 3 works for the new setting. 2. 2.
Suppose that the forest consists of disjoint trees rooted at respectively. Also, let be fixed. First, using Algorithm 3, compute the value of , for every integers , and . Also, define
[TABLE]
The following recursion helps us to solve the problem on . For every , define
[TABLE]
Then, the solution to CMSC problem is yes if and only if . Furhermore, One may easily extend this recursion to solve the corresponding problem on forests with potentials. 3. 3.
In this setting some vertices should be or should not be in the residue set. The problem can be solved in the following steps:
First, for each vertex , define a potential as follows
[TABLE]
- -
Now, let be a forest obtained from by deleting all vertices in . Also, let .
- -
If is empty, then the solution can be obtained by performing the method given in 2 on the forest with the potential weight and the numbers . If is non-empty, we have to make the following additional modifications to handle the problem.
Suppose that is a tree and is a subset of vertices. Also, numbers are given. We are looking for a connected subpartition such that , and . Note that Lemma 3 is still valid in this setting. However, in the computation of , for each , in Lemma 2, if , then is not allowed to be in the residue set. So, the value of is equal to if and only if . Thus, with a similar proof as in Lemma 2, we can prove that
[TABLE]
Then, Algorithm 2 can be modified accordingly to compute the value of .
5 Concluding remarks and future work
In this paper, a multi-way sparsest cut problem has been investigated for weighted trees and it was shown that although the problem is complete for trees, it becomes tractable when the search space is confined to connected subdomains. One of the strengths of the method is that it has a control on the number of outliers and can manage semi-supervised settings when some data points are forced or forbidden to be outlier. Besides the theoretical importance of the sparsest cut problem, when our method is applied to the minimum spanning tree, it can steer several applications in both unsupervised and semi-supervised clustering.
One may also consider an analogous problem when we are seeking for a subpartition minimizing “the average” (instead of the maximum) of the edge expansions of the parts (e.g. as in [21]). This objective function is more sensitive and exquisite and are more likely to produce high-quality clustering results. Nevertheless, the problem unfortunately turns out to be complete on trees even when the search space is restricted to connected subpartitions (or partitions) [9]. Finding a good approximation algorithm for this problem is an interesting and challenging task that can be the purpose of future work in this line of research.
Acknowledgment. We would like to express our sincere thanks to Amir Daneshgar whose valuable comments were crucial in preparing and improving the present article.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Alon, Eigenvalues and expanders , Combinatorica 6 (1986), no. 2, 83–96, Theory of computing (Singer Island, Fla., 1984). MR 875835
- 2[2] N. Alon and V. D. Milman, λ 1 , subscript 𝜆 1 \lambda_{1}, isoperimetric inequalities for graphs, and superconcentrators , J. Combin. Theory Ser. B 38 (1985), no. 1, 73–88. MR 782626
- 3[3] Sanjeev Arora, Satish Rao, and Umesh Vazirani, Expander flows, geometric embeddings and graph partitioning , J. ACM 56 (2009), no. 2, Art. 5, 37. MR 2535878
- 4[4] James C. Bezdek, Pattern recognition with fuzzy objective function algorithms , Plenum Press, New York-London, 1981, With a foreword by L. A. Zadeh, Advanced Applications in Pattern Recognition. MR 631231
- 5[5] Peter Buser, Geometry and spectra of compact Riemann surfaces , Progress in Mathematics, vol. 106, Birkhäuser Boston, Inc., Boston, MA, 1992. MR 1183224
- 6[6] Isaac Chavel, Eigenvalues in Riemannian geometry , Pure and Applied Mathematics, vol. 115, Academic Press, Inc., Orlando, FL, 1984, Including a chapter by Burton Randol, With an appendix by Jozef Dodziuk. MR 768584
- 7[7] Fan R. K. Chung, Spectral graph theory , CBMS Regional Conference Series in Mathematics, vol. 92, Published for the Conference Board of the Mathematical Sciences, Washington, DC; by the American Mathematical Society, Providence, RI, 1997. MR 1421568
- 8[8] Amir Daneshgar, Hossein Hajiabolhassan, and Ramin Javadi, On the isoperimetric spectrum of graphs and its approximations , J. Combin. Theory Ser. B 100 (2010), no. 4, 390–412. MR 2644242
