The Effectiveness of Uniform Sampling for Center-Based Clustering with Outliers
Hu Ding, Jiawei Huang, Haikuo Yu

TL;DR
This paper introduces a new unified framework analyzing uniform sampling for center-based clustering with outliers, demonstrating its efficiency and effectiveness in reducing complexity while maintaining quality.
Contribution
It presents the first uniform sampling approach that discards exactly z outliers and provides theoretical guarantees based on a significance criterion.
Findings
Sample size can be independent of n/z ratio and dimensionality.
Uniform sampling achieves comparable results to existing methods.
Method significantly reduces running time in experiments.
Abstract
Clustering has many important applications in computer science, but real-world datasets often contain outliers. Moreover, the presence of outliers can make the clustering problems to be much more challenging. To reduce the complexities, various sampling methods have been proposed in past years. Namely, we take a small sample (uniformly or non-uniformly) from input and run an existing approximation algorithm on the sample. Comparing with existing non-uniform sampling methods, the uniform sampling approach has several significant benefits. For example, it only needs to read the data in one-pass and is very easy to implement in practice. Thus, the effectiveness of uniform sampling for clustering with outliers is a natural and fundamental problem deserving to study in both theory and practice. In this paper, we propose a new and unified framework for analyzing the effectiveness of uniform…
| 0.6 | 0.8 | 1 | 2 | 4 | 6 | |
|---|---|---|---|---|---|---|
| Algorithm 1 | 0.512 | 0.734 | 0.755 | 0.746 | 0.774 | 0.876 |
| Algorithm 1 | 0.808 | 0.897 | 0.839 | 0.999 | 0.999 | 0.852 |
| Algorithm 1 | 0.874 | 0.845 | 0.973 | 0.999 | 0.999 | 1.000 |
| Algorithm 1 | 0.974 | 0.997 | 0.997 | 0.998 | 0.999 | 0.999 |
| Algorithm 2 | None | None | None | 1.000 | 1.000 | 1.000 |
| MK | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Malkomes | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Charikar | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| DYW | 0.999 | 0.997 | 0.997 | 0.998 | 0.998 | 0.997 |
| Algorithm 3 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 |
| Algorithm 3 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 0.999 |
| Algorithm 3 | 1.000 | 1.000 | 0.999 | 0.999 | 0.997 | 0.998 |
| Algorithm 4 | None | None | None | 1.000 | 1.000 | 1.000 |
| -means | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| LocalSearch | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| DataSummary | 1.000 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 (-median) | 0.994 | 0.998 | 0.997 | 0.998 | 0.999 | 0.999 |
| Algorithm 3 (-median) | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 (-median) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 (-median) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 4 (-median) | None | None | None | 1.000 | 1.000 | 1.000 |
| -means (-median) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| LocalSearch (-median) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| DataSummary (-median) | 0.958 | 0.932 | 0.948 | 0.927 | 0.964 | 0.938 |
| 0.6 | 0.8 | 1 | 2 | 4 | 6 | |
|---|---|---|---|---|---|---|
| Algorithm 1 | 0.746 | 0.804 | 0.722 | 0.977 | 0.896 | 0.978 |
| Algorithm 1 | 0.860 | 0.981 | 0.931 | 1.000 | 0.999 | 0.976 |
| Algorithm 1 | 0.957 | 0.953 | 0.893 | 1.000 | 1.000 | 1.000 |
| Algorithm 1 | 0.972 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 2 | None | None | None | 1.000 | 1.000 | 1.000 |
| MK | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Malkomes | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Charikar | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| DYW | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 | 0.972 | 0.989 | 0.986 | 0.986 | 0.961 | 0.914 |
| Algorithm 3 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 4 | None | None | None | 1.000 | 1.000 | 1.000 |
| -means | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| LocalSearch | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| DataSummary | 0.958 | 0.932 | 0.948 | 0.927 | 0.964 | 0.938 |
| Algorithm 3 (-median) | 0.997 | 0.994 | 0.999 | 0.998 | 0.996 | 0.994 |
| Algorithm 3 (-median) | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 (-median) | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 3 (-median) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Algorithm 4 (-median) | None | None | None | 1.000 | 1.000 | 1.000 |
| -means (-median) | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| LocalSearch (-median) | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| DataSummary (-median) | 0.872 | 0.912 | 0.968 | 0.984 | 0.992 | 0.991 |
| Shuttle | Kddcup | Covtype | Poking Hand | |
|---|---|---|---|---|
| Algorithm 1 | 0.904 | 0.966 | 0.612 | 0.959 |
| Algorithm 1 | 0.906 | 0.964 | 0.695 | 0.978 |
| Algorithm 1 | 0.913 | 0.965 | 0.753 | 0.975 |
| Algorithm 1 | 0.904 | 0.966 | 0.908 | 0.984 |
| Algorithm 2 | 0.903 | 0.957 | 0.513 | 0.965 |
| MK | 0.903 | 0.978 | 0.502 | 0.968 |
| Malkomes | 0.933 | 0.959 | 0.754 | 0.957 |
| DYW | 0.896 | 0.960 | 0.804 | 0.986 |
| Algorithm 3 | 0.886 | 0.946 | 0.823 | 0.993 |
| Algorithm 3 | 0.883 | 0.947 | 0.900 | 0.991 |
| Algorithm 3 | 0.883 | 0.947 | 0.916 | 0.990 |
| Algorithm 3 | 0.886 | 0.948 | 0.897 | 0.991 |
| Algorithm 4 | 0.906 | 0.958 | 0.807 | 0.986 |
| -means | 0.883 | 0.971 | 0.793 | 0.999 |
| LocalSearch | 0.894 | 0.958 | 0.795 | 0.973 |
| DataSummary | 0.889 | 0.950 | 0.764 | 0.988 |
| Algorithm 3 (-median) | 0.889 | 0.946 | 0.728 | 0.995 |
| Algorithm 3 (-median) | 0.886 | 0.946 | 0.802 | 0.990 |
| Algorithm 3 (-median) | 0.881 | 0.946 | 0.802 | 0.990 |
| Algorithm 3 (-median) | 0.887 | 0.949 | 0.862 | 0.989 |
| Algorithm 4 (-median) | 0.917 | 0.966 | 0.777 | 0.994 |
| -means (-median) | 0.882 | 0.966 | 0.733 | 0.999 |
| LocalSearch (-median) | 0.891 | 0.953 | 0.751 | 0.964 |
| DataSummary (-median) | 0.882 | 0.966 | 0.734 | 0.999 |
| Shuttle | Kddcup | Covtype | Poking Hand | |
|---|---|---|---|---|
| Algorithm 1 | 0.810 | 0.690 | 0.491 | 0.502 |
| Algorithm 1 | 0.837 | 0.848 | 0.493 | 0.504 |
| Algorithm 1 | 0.859 | 0.872 | 0.496 | 0.505 |
| Algorithm 1 | 0.913 | 0.904 | 0.504 | 0.505 |
| Algorithm 2 | 0.830 | 0.851 | 0.495 | 0.504 |
| MK | 0.790 | 0.579 | 0.513 | 0.501 |
| Malkomes | 0.789 | 0.632 | 0.498 | 0.500 |
| DYW | 0.790 | 0.580 | 0.492 | 0.501 |
| Algorithm 3 | 0.793 | 0.579 | 0.490 | 0.508 |
| Algorithm 3 | 0.790 | 0.580 | 0.494 | 0.507 |
| Algorithm 3 | 0.803 | 0.582 | 0.504 | 0.505 |
| Algorithm 3 | 0.797 | 0.582 | 0.508 | 0.505 |
| Algorithm 4 | 0.793 | 0.579 | 0.490 | 0.501 |
| -means | 0.818 | 0.579 | 0.491 | 0.501 |
| LocalSearch | 0.790 | 0.579 | 0.498 | 0.511 |
| DataSummary | 0.832 | 0.591 | 0.488 | 0.504 |
| Algorithm 3 (-median) | 0.790 | 0.582 | 0.491 | 0.508 |
| Algorithm 3 (-median) | 0.798 | 0.579 | 0.493 | 0.505 |
| Algorithm 3 (-median) | 0.797 | 0.581 | 0.499 | 0.504 |
| Algorithm 3 (-median) | 0.790 | 0.582 | 0.508 | 0.505 |
| Algorithm 4 (-median) | 0.789 | 0.579 | 0.492 | 0.501 |
| -means (-median) | 0.790 | 0.579 | 0.492 | 0.501 |
| LocalSearch (-median) | 0.811 | 0.627 | 0.496 | 0.510 |
| DataSummary (-median) | 0.829 | 0.602 | 0.489 | 0.502 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Face and Expression Recognition
11institutetext: School of Computer Science and Engineering, University of Science and Technology of China
He Fei, China
11email: [email protected], 11email: [email protected], 11email: [email protected]
The Effectiveness of Uniform Sampling for Center-Based Clustering with Outliers
Hu Ding
Jiawei Huang
and Haikuo Yu
Abstract
Clustering has many important applications in computer science, but real-world datasets often contain outliers. Moreover, the presence of outliers can make the clustering problems to be much more challenging. To reduce the complexities, various sampling methods have been proposed in past years. Namely, we take a small sample (uniformly or non-uniformly) from input and run an existing approximation algorithm on the sample. Comparing with existing non-uniform sampling methods, the uniform sampling approach has several significant benefits. For example, it only needs to read the data in one-pass and is very easy to implement in practice. Thus, the effectiveness of uniform sampling for clustering with outliers is a natural and fundamental problem deserving to study in both theory and practice. The previous analyses on uniform sampling often indicate that the sample size should depend on the ratio , where is the number of input points and is the pre-specified number of outliers, and the dimensionality (for instance in Euclidean space), which could be both very high. Moreover, to guarantee the desired clustering qualities, they need to discard more than outliers. In this paper, we propose a new and unified framework for analyzing the effectiveness of uniform sampling for three representative center-based clustering with outliers problems, -center/median/means clustering with outliers. We introduce a “significance” criterion and prove that the performance of uniform sampling depends on the significance degree of the given instance. In particular, we show that the sample size can be independent of the ratio and the dimensionality. More importantly, to the best of our knowledge, our method is the first uniform sampling approach that allows to discard exactly outliers for these three center-based clustering with outliers problems. The results proposed in this paper also can be viewed as an extension of the previous sub-linear time algorithms for the ordinary clustering problems (without outliers). The experiments suggest that the uniform sampling method can achieve comparable clustering results with other existing methods, but greatly reduce the running times.
1 Introduction
Clustering is a fundamental topic that has many important applications in real world [30]. An important type of clustering problems is called “center-based clustering” including the well-known -center/median/means clustering problems [3]. Center-based clustering problems can be defined in arbitrary metrics and Euclidean space . Usually, a center-based clustering problem aims to find cluster centers so as to minimize the induced clustering cost. For example, the -center clustering problem is to minimize the maximum distance from the input data to the set of cluster centers [26, 22]; the -median (means) clustering problem is to minimize the average (squared) distance instead [35, 34].
Real-world datasets often contain outliers that could seriously destroy the final clustering results [43, 8]. Clustering with outliers can be viewed as a generalization of the ordinary clustering problems; however, the presence of outliers makes the problems to be much more challenging. Charikar et al. [9] proposed a -approximation algorithm for -center clustering with outliers in arbitrary metrics. The time complexity of their algorithm is at least quadratic in data size, since it needs to read all the pairwise distances. A following streaming -approximation algorithm was proposed by McCutchen and Khuller [38]. Chakrabarty et al. [7] showed a -approximation algorithm for metric -center clustering with outliers based on the LP relaxation techniques. Recently, Ding et al. [16] provided a greedy algorithm that yields a bi-criteria approximation (returning more than clusters) based on the idea of the Gonzalez’s -center clustering algorithm [22]. Bădoiu et al. [4] showed a coreset based approach but having an exponential time complexity if is not a constant (a “coreset” is a small set of points that approximates the structure/shape of a much larger point set, and thus can be used to significantly reduce the time complexities for many optimization problems [18]). The coresets for instances in doubling metrics were studied in [16, 6, 27].
For -median/means clustering with outliers, the algorithms with provable guarantees [13, 32, 21] are difficult to implement due to their high complexities. Several heuristic algorithms without provable guarantee have been studied before [11, 42]. By using the local search method, Gupta et al.[25] provided a -approximation algorithm for -means clustering with outliers; they also showed that the well known -means++ method [2] can be used as a coreset approach to reduce the complexity. Very recently, Im et al.[28] provided a method for constructing the coreset of -means clustering with outliers by combining -means++ and uniform sampling. Partly inspired by the successive sampling method of [39], Chen et al. [12] proposed a novel summary construction algorithm to reduce input data size.
Moreover, due to the rapid increase of data volumes in real world, a number of communication efficient distributed algorithms for -center clustering with outliers [36, 23, 6, 33] and -median/means clustering with outliers [23, 33, 12] were proposed in recent years.
1.1 Existing Sampling Methods and Our Main Results
As mentioned in above, existing algorithms for clustering with outliers often have high complexities (e.g., quadratic complexity). Therefore, several sampling methods have been studied for reducing the complexities. Namely, we take a small sample (uniformly or non-uniformly) from input and run an existing approximation algorithm on the sample. The non-uniform sampling methods include the aforementioned greedy algorithm [16], -means++ [25, 28], and successive sampling [12]. However, these approaches suffer several drawbacks in practice; for example, they need to read the input dataset in multiple passes with high computational complexities, or have to discard more than the pre-specified number of outliers. The sensitivity-based coreset method is also a popular non-uniform sampling approach for ordinary clustering problems [19]. Informally, each data point has the “sensitivity” to measure its importance to the whole dataset; the coreset construction is a simple sampling procedure where each point is drawn i.i.d. proportional to its sensitivity. However, to the best of our knowledge, the sensitivity-based coreset approach is not quite ideal to handle outliers, as it is not easy to compute the sensitivities because each point could be inlier or outlier for different solutions.
Due to the simplicity, the idea of uniform sampling has attracted a lot of attention. We follow the usual definition of “uniform sampling” in the articles [40, 41, 14], where it means that we take a sample from the input independently and uniformly at random. Suppose is the input size and is the number of outliers. Charikar et al.[10] and Meyerson et al. [40] respectively provided uniform sampling approaches for reducing data size for clustering with outliers; Huang et al. [27] and Ding et al. [16] presented similar results for instance in Euclidean space. However, these methods usually suffer the following dilemma.
The dilemma for uniform sampling. Let be the uniform sample from the input. If we try to avoid sampling any outlier, the size of should not be too large; but a small is more likely to yield a large clustering error. If we keep large enough, it is necessary to have an accurate enough estimation on the number of sampled outliers in ; that means should be larger than some threshold depending on (and the dimensionality for instance in Euclidean space), which could be very large (e.g., could be high and could be much smaller than ). As the results proposed in [27, 16], should be at least an -sample (or some other variants) with the size depending on the VC-dimension in Euclidean space and the ratio . Moreover, since it is impossible to know the exact number of sampled outliers in (even if is large), the error on the number of discarded outliers seems to be inevitable in almost all the previous uniform sampling approaches [10, 40, 27, 16]; that is, to guarantee the desired clustering qualities, they need to discard more than outliers.
The uniform sampling method was also applied to design sub-linear time algorithms for ordinary -median/means clustering (without outliers) problems [40, 41, 14, 29], and we notice that the sample sizes proposed in [40, 14] are independent of and . Therefore, a natural question is that whether their results can be extended for the clustering with outliers problems; in other words, is it possible to remove the dependencies on the values and in the sample complexity of uniform sampling? Another key question is that whether we can discard exactly outliers when using uniform sampling.
Our contributions. Though the uniform sampling approach suffers the above dilemma in theory, it often achieves nice performance in practice even if the sample size is much smaller than the theoretical bounds proposed in [10, 40, 27, 16]. To explain this phenomenon, we propose a new and unified framework for analyzing the effectiveness of uniform sampling for -center/median/means clustering with outliers. We show that the sample size can be independent of the ratio and the dimensionality , under some reasonable assumption. If we only require to output cluster centers, our uniform sampling approach runs in sub-linear time that is independent of the input size111Obviously, if we require to output the clustering membership for each point, it needs at least linear time.. To further boost the success probability, we can take multiple samples and select the one yielding the smallest objective value by scanning the whole dataset in one-pass. More importantly, to the best of our knowledge, our method is the first uniform sampling approach that allows to discard exactly outliers.
In our framework, we consider the relation between two values: the lower bound of the sizes of the optimal clusters and the number of outliers (the formal definitions will be shown in Section 1.2). If a cluster has size , then we can say that is not a “significant” cluster. In real applications, we may only have an estimation for the number of clusters . Consequently, if there exists a cluster having size , we can formulate the problem as a simpler -center/median/means clustering with outliers instead. So we can assume that each cluster has a size comparable to . If (note that the ratio could be a value smaller than , say , in this case), our framework outputs cluster centers that yield a -approximation for -center clustering with outliers; further, if , our framework returns exactly cluster centers that yield a -approximation solution, if we run an existing -approximation algorithm with on the sample. The framework can also handle -median/means clustering with outliers and yields similar results.
We should point out that Meyerson et al. [40] also considered the lower bound of the cluster sizes when designing their sub-linear time -median clustering (without outliers) algorithm. However, it is challenging to directly extend their idea to handle the case with outliers (as explained in the aforementioned dilemma for uniform sampling), and thus we need to develop significantly new ideas in our algorithms design and analysis. Recently, Gupta [24] proposed a similar uniform sampling approach to handle -means clustering with outliers. However, their analysis and results are quite different from ours. Also their assumption is stronger: it requires that each optimal cluster has size roughly (i.e., ) where is a small parameter in .
1.2 Preliminaries
Let the input be a point set with . Given a set of points and a positive integer , we define the following notations.
[TABLE]
where and denotes the distance between and .
Definition 1 (-Center/Median/Means Clustering with Outliers)
Given a set of points in with two positive integers and , the problem of -center (resp., -median, -means) clustering with outliers is to find cluster centers , such that the objective function (resp., , ) is minimized.
The definition can be easily modified for arbitrary metric space , where contains vertices and is the distance function: the Euclidean distance “” is replaced by ; the cluster centers should be chosen from . In this paper, we always use , a subset of with size , to denote the subset yielding the optimal solution with respect to the objective functions in Definition 1. Also, let be the optimal clusters forming .
As mentioned in Section 1.1, it is rational to assume that is not far smaller than . To formally state this assumption, we introduce the following definition.
Definition 2 (-Significant Instance)
Let . Given an instance of -center (resp., -median, -means) clustering with outliers as described in Definition 1, if and , we say that it is an -significant instance.
Obviously, since , should be smaller than . In Definition 2, we do not say “”, since we may not be able to obtain the exact value of ; instead, we may only have a lower bound of . The ratio () reveals the “significance” of the clusters to outliers; the higher the ratio is, the more significant the clusters to outliers will be.
The rest of the paper is organized as follows. To help our analysis, we present two implications of Definition 2 in Section 2. Then we introduce our uniform sampling algorithms for -center clustering with outliers and -median/means clustering with outliers in Section 3 and 4, respectively. We also explain that how to boost the success probability of our framework and how to further determine the clustering memberships of data points in Section 5. Finally, we present our experimental results in Section 7.
2 Implications of Significant Instance
Lemma 1
Given an -significant instance as described in Definition 2, one selects a set of points from uniformly at random. Let . (i) If , with probability at least , for any . (ii) If , with probability at least , for any .
Lemma 1 can be obtained by using the Chernoff bound [1], and we leave the proof to Section 9. Moreover, we know that the expected number of outliers contained in the sample is . So we immediately have the following result by using the Markov’s inequality.
Lemma 2
Given an -significant instance as described in Definition 2, one selects a set of points from uniformly at random. Let . With probability at least , \big{|}S\setminus P_{opt}\big{|}\leq\frac{\epsilon_{2}}{k\eta}|S|.
3 Uniform Sampling for -Center Clustering with Outliers
Let be the optimal radius of the instance , i.e., each optimal cluster is covered by a ball with radius . For any point and any value , we use to denote the ball centered at with radius . We present Algorithm 1 and 2 for the -center clustering with outliers problem, and prove their clustering qualities in Theorem 3.1 and 3.2 respectively. We only focus on the problem in Euclidean space due to the space limit, but the results also hold for abstract metric space by using the same idea.
Theorem 3.1
In Algorithm 1, the size . Also, with probability at least , .
Remark 1
(i) If we assume both and are , Theorem 3.1 indicates that Algorithm 1 returns cluster centers. means that , that is, the number of outliers is not significantly larger than the size of the smallest cluster. The running time of the subroutine algorithm [22] in Step 2 is , which is independent of the input size . (ii) In Section 6, we present an example to show that the value of cannot be further reduced. That is, the clustering quality could be arbitrarily bad if we run -center clustering on with .
Proof
(of Theorem 3.1) First, it is straightforward to know that . Below, we assume that the sample contains at least one point from each , and at most points from (these events happen with probability at least due to Lemma 1 and 2).
Since the sample contains at most points from and can be covered by balls with radius , we know that can be covered by balls with radius . Thus, if we perform the -approximation -center clustering algorithm [22] on , the resulting balls should have radius no larger than . Let and be those balls covering with . Also, for each , since , there exists one ball of , say , covers at least one point, say , from . For any point , we have (by the triangle inequality) and ; therefore, . See Figure 1. Overall, is covered by the balls , i.e., . ∎
Theorem 3.2
If , with probability at least , Algorithm 2 returns cluster centers achieving a -approximation for -center clustering with outliers, i.e., .
Remark 2
For example, if we set , the algorithm works for any instance with . Actually, as long as (i.e., ), we can always find the appropriate values for and to satisfy . For example, we can set and ; obviously, if is close to , the success probability could be small (we will show that how to boost the success probability in Section 5). The running time depends on the complexity of the subroutine -approximation algorithm used in Step 2. For example, the algorithm of [9] takes O\big{(}|S|^{2}D+k|S|^{2}\log|S|\big{)} time in .
Proof
(of Theorem 3.2) We assume that for each , and has at most points from (these events happen with probability at least due to Lemma 1 and 2).
Let be the set of balls returned in Step 2 of Algorithm 2. Since can be covered by balls with radius and , the optimal radius for the instance with outliers should be at most . Consequently, . Moreover,
[TABLE]
for any , where the last inequality comes from . Thus, if we perform -center clustering with outliers on , the resulting balls must cover at least one point from each (since from (4)). Through a similar manner as the proof of Theorem 3.1, we know that is covered by the balls , i.e., . ∎
4 Uniform Sampling for -Median/Means Clustering with Outliers
For the problem of -means clustering with outliers, we apply the similar ideas as Algorithm 1 and 2 (see Algorithm 3 and 4). However, the analyses are more complicated here. For ease of understanding, we present our high-level idea first. Also, due to the space limit, we show the extensions for -median clustering with outliers and their counterparts in arbitrary metric space in Section 10.
High-level idea. Let be a large enough random sample from . Denote by the mean points of , respectively. We first show that can well approximate for each . Informally,
[TABLE]
We also define a transformation on to help our analysis.
Definition 3 (Star Shaped Transformation)
For each point in , we translate it to ; overall, we generate a new set of points located at , where each has overlapping points. For any point with , denote by its transformed point; for any , denote by its transformed point set.
Since the transformation forms “stars” (see Figure. 2), we call it “star shaped transformation”. Let . By using (5), we can prove that the clustering costs of and are close (after the normalization) for any given set of cluster centers. Let be the cluster centers returned by Algorithm 3. Then, we can use and as the “bridges” between and , so as to prove that yields an approximate solution for the instance , i.e., is bounded.
In Algorithm 4, we run -means with outliers algorithm on the sample and return (rather than ) cluster centers. So we need to modify the above idea to analyze the quality. Let be the set of inliers of obtained in Step 2. If the ratio is large enough, we can prove that for each . Therefore, we can replace “” by “” in (5) and prove a similar quality guarantee for Algorithm 4.
In Theorem 4.1 and 4.2, denotes the maximum diameter of the clusters , i.e., \mathcal{L}$$=\max_{1\leq j\leq k}\max_{p,q\in C^{*}_{j}} . Actually, our result can be viewed as an extension of the sub-linear time -median/means clustering algorithms [40, 41, 14] to the case with outliers. We also want to emphasize that the additive error is unavoidable even for the vanilla case (without outliers), if we require the sample complexity to be independent of the input size [41, 14]; the -median clustering with outliers algorithm proposed by Meyerson et al. [40] does not yield an additive error, but it needs to discard more than outliers and the sample size depends on the ratio . We place the full proofs of Theorem 4.1 and 4.2 in Section 4.1 and 4.2 respectively.
Theorem 4.1
With probability at least , the set of cluster centers returned by Algorithm 3 results in a clustering cost at most , where \alpha=\big{(}2+(4+4c)\frac{1+\delta}{1-\delta}\big{)} and .
Remark 3
(i) In Step 2 of Algorithm 3, we can apply an -approximation -means algorithm (e.g., [31]). If we assume and are fixed constants, then the sample size , and both the factors and are , i.e., ; moreover, the number of returned cluster centers if . (ii) Similar to Theorem 3.1, we use the same example in Section 6 to show that the value of cannot be further reduced in Algorithm 3.
Theorem 4.2
Assume . With probability at least , the set of cluster centers returned by Algorithm 4 results in a clustering cost at most , where \alpha=\big{(}2+(4+4c)\frac{t}{t-1}\frac{1+\delta}{1-\delta}\big{)} and .
Similar to Theorem 3.2, as long as , we can set and to keep .
4.1 Proof of Theorem 4.1
The following lemma can be obtained via the Hoeffding’s inequality (each can be viewed as a random variable between [math] and ) [1].
Lemma 3
We fix a cluster . Given , if one uniformly selects a set of or more points at random from ,
[TABLE]
with probability at least .
Lemma 4
If one uniformly selects a set of points at random from ,
[TABLE]
for , with probability at least .
Proof
Suppose . According to Lemma 1, indicates that for each ; further, implies that . Therefore, and by Lemma 3 we have
[TABLE]
with probability at least for each ( is replaced by in Lemma 3 for taking the union bound). From (8) we know that
[TABLE]
where the second inequality comes from Lemma 1. So we complete the proof. ∎
We define a new notation that is used in the following lemmas. Given two point sets and , we use to denote the clustering cost of by taking as the cluster centers, i.e., . Obviously, . Let . Below, we prove the upper bounds of , , and respectively, and use these bounds to complete the proof of Theorem 4.1. For convenience, we always assume that the events mentioned in Lemma 1, 2, and 4 all happen so that we do not need to repeatedly state the success probabilities.
Lemma 5
Cost(S_{opt},O^{*})\leq(1+\delta)\frac{|S|}{n}(n-z)\big{(}\Delta^{-z}_{2}(P,O^{*})+\xi\mathcal{L}^{2}\big{)}.
Proof
First, we have
[TABLE]
by Lemma 4. Further, since and , we know that is no larger than (1+\delta)\frac{|S|}{n}(n-z)\big{(}\Delta^{-z}_{2}(P,O^{*})+\xi\mathcal{L}^{2}\big{)}. ∎
Lemma 6
.
Proof
We fix a point , and assume that the nearest neighbors of and in are and , respectively. Then, we have
[TABLE]
via the triangle inequality. Therefore,
[TABLE]
Moreover, since (because ) and yields a -approximate clustering cost of the -means clustering on , we have
[TABLE]
where is the optimal clustering cost of -means clustering on . Let be the farthest points of to , then the set also forms a solution for -means clustering on ; namely, is partitioned into clusters where each point of is a cluster having a single point. Obviously, such a clustering yields a clustering cost . Consequently,
[TABLE]
Also, Lemma 2 shows that contains at most points from , i.e., . Thus, . Together with (12), (13), and (14), we have . ∎
Lemma 7
.
Proof
From the constructions of and , we know that they are overlapping points locating at . From Lemma 1, we know , i.e., for . Overall, we have Cost(\tilde{P}_{opt},H)=\sum^{k}_{j=1}|C^{*}_{j}|\big{(}dist(o^{*}_{j},H)\big{)}^{2} that is at most \frac{1}{1-\delta}\frac{n}{|S|}\sum^{k}_{j=1}|S\cap C^{*}_{j}|\big{(}dist(o^{*}_{j},H)\big{)}^{2}=\frac{1}{1-\delta}\frac{n}{|S|}Cost(\tilde{S}_{opt},H). ∎
Now, we are ready to prove Theorem 4.1. Note that actually is the -means clustering cost of by removing the farthest points to , and . So we have . Further, by using a similar manner of (12), we have . Therefore,
[TABLE]
where (15), (16), and (17) come from Lemma 7, 6, and 5 respectively, and (18) comes from the fact . The success probability comes from Lemma 4 and Lemma 2 (note that Lemma 4 already takes into account of the success probability of Lemma 1 ). Thus, we obtain Theorem 4.1.
4.2 Proof of Theorem 4.2
Suppose the clusters of obtained in Step 2 of Algorithm 4 are , and thus the inliers . In the following lemmas, we always assume that the events mentioned in Lemma 1, 2, and 4 all happen so that we do not need to repeatedly state the success probabilities.
Lemma 8
* for each .*
Proof
Since and for each (by Lemma 1), we have
[TABLE]
where the last inequality comes from . Thus . ∎
Lemma 9
Cost(S_{in}\cap P_{opt},H)\leq(1+\delta)\frac{|S|}{n}\cdot c\cdot\big{(}Cost(P_{opt},O^{*})+(n-z)\cdot\xi\mathcal{L}^{2}\big{)}.
Proof
Since , we have
[TABLE]
Because is a -approximation on ,
[TABLE]
where the last inequality comes from (20). Therefore,
[TABLE]
where the second and third inequalities comes from (21) and Lemma 5, respectively. ∎
Since , we immediately have the following lemma via Lemma 5.
Lemma 10
Cost(S_{in}\cap P_{opt},O^{*})\leq(1+\delta)\frac{|S|}{n}(n-z)\big{(}\Delta^{-z}_{2}(P,O^{*})+\xi\mathcal{L}^{2}\big{)}.
For convenience, let . Using the same manner of (12), we have
[TABLE]
Also, because and , we have
[TABLE]
where the last inequality comes from Lemma 8. From (23), (24), and (25), we have
[TABLE]
Combining Lemma 9 and 10, we can obtain Theorem 4.2 from (26) by simple calculation.
5 Success Probability and Clustering Memberships
The parameter determines the success probabilities of our algorithms. In particular, as mentioned in Remark 2, we cannot set too small to guarantee “” in Algorithm 2 (and similarly in Algorithm 4). To satisfy this requirement, we need to set large enough and therefore the success probability could be low. In fact, we can run the algorithm multiple times so as to achieve a higher success probability; for example, if and we run the algorithm times, the success probability will be . Suppose we run the algorithm (Algorithm 2 or 4) times and let be the set of output candidates. The remaining issue is that how to select the one achieving the smallest objective value among all the candidates.
A simple way is to directly scan the whole dataset in one-pass. When reading a point from , we calculate its distance to all the candidates, i.e., ; after scanning the whole dataset, we have calculated the clustering costs (resp., and ) for and return the best one. Moreover, a by-product of this procedure is that we can determine the clustering memberships of data points simultaneously. When calculating for , we record the index of its nearest cluster center in ; finally, we return the corresponding clustering memberships after selecting the best candidate.
We are aware of the sampling method proposed by Meyerson et al. [40] for estimating -median clustering cost; but it will induce an error on the number of outliers for our clustering with outliers problems. As mentioned in Section 1.1, the sampling based ideas in [10, 27, 16] also have the same issue.
6 An Example for Theorem 3.1 and 4.1
We construct the following instance for -center clustering with outliers. Let be an -significant instance in , where each optimal cluster is a set of overlapping points located at its cluster center for . Let , and we assume
[TABLE]
Obviously, the optimal radius of is equal to [math]. Suppose we obtain a sample satisfying for any and . Given a number , we run -center clustering on . Since the points of take distinct locations in the space, any -center clustering on will result in a radius (at least ) larger than [math]; thus the approximation ratio is equal to . So, the value of cannot be further reduced in Theorem 3.1. It is easy to see that this instance also can be used to show that should be at least in Theorem 4.1.
7 Experiments
All the experimental results were obtained on a Windows workstation with 2.8GHz Intel(R) Core(TM) i5-840 and 8GB main memory; the algorithms are implemented in Matlab R2018a. To evaluate the performance, we use several baseline algorithms including the non-uniform sampling approaches mentioned in Section 1.1 (we only consider the setting with single machine in this paper). For -center clustering with outliers, we consider four existing algorithms: the -approximation Charikar [9], the -approximation MK [38], the -approximation Malkomes [36], and the greedy algorithm DYW [16] (as the non-uniform sampling approach). The distributed algorithm Malkomes partitions the dataset into parts and processes each part separately; to make a fair comparison, we set for Malkomes in our experiments. In our Algorithm 2, we apply MK as the subroutine in Step 2.
For -means/median clustering with outliers, we consider the heuristic algorithm -means [11] and two non-uniform sampling methods: the local search algorithm LocalSearch with -means++ [25], and the recent data summary based algorithm DataSummary [12]. In our Algorithm 3 (resp., Algorithm 4), we apply the -means++ [2] (resp., -means) as the subroutine in Step 2.
Datasets. We generate the synthetic datasets in , and set , , and . We randomly generate k points as the cluster centers inside a hypercube of side length ; around each center, we generate a cluster of points following a Gaussian distribution with standard deviation ; we keep the total number of points to be ; to study the performance of our algorithms with respect to the ratio , we vary the size of the smallest cluster appropriately for each synthetic dataset; finally, we uniformly generate outliers at random outside the minimum enclosing balls of these clusters.
We choose real datasets from UCI machine learning repository [17]. Covertype has clusters with points in ; Kddcup has clusters with points in ; Poking Hand has clusters with points in ; Shuttle has clusters with points in . Each dataset also has some tiny clusters with total size , and we view them as outliers. To keep the fraction of outliers to be , we add extra outliers outside the enclosing balls of the clusters as we did for the synthetic datasets.
Settings. For each dataset, we run our algorithms and the baseline algorithms trials and report the average results. For our algorithms, it is not quite convenient to set the values for the parameters in practice. Instead, it is more intuitive to directly set the sample size and for Algorithm 1 and 3 (resp., for Algorithm 2 and 4); actually, we only need these two numbers (resp., ) to implement our algorithms. In our experiments on both synthetic and real datasets, we set the sample size to be ; for completeness, we also investigate the stability with varying in another experiment below. Algorithm 1 and 3 both output cluster centers, and we define the ratio . The value could be large. Though Section 6 indicates that cannot be reduced with respect to the worst case, we do not strictly follow this (overly conservative) theoretical value in our experiments. Instead, we keep the ratio to be , and (i.e., run the algorithms [22, 2] steps). For Algorithm 2 (resp., Algorithm 4), we set that is times the expected number of outliers in ; we run the algorithm times for each instance and select the best candidate by scanning the whole dataset in one-pass as discussed in Section 5 (we count the running time of the whole process). Further, we conduct another two experiments below to observe the influence of and the stabilities of the results returned by Algorithm 2 and 4 (if we just run each of them by one time).
Objective value and running time. The obtained objective values of our and the baseline algorithms are shown in Figure 3; the running times are shown in Figure 4; due to the space limit, we leave the experimental results of -median clustering with outliers to appendix (Section 11). For -center clustering with outliers, our algorithms (Algorithm 2 and Algorithm 1 with ) and the four baseline algorithms achieve similar objective values for most of the instances (we run Algorithm 2 on the synthetic datasets with only; we do not run Charikar on the real datasets due to its high complexity). Moreover, the running times of our Algorithm 1 and 2 are significantly lower comparing with the baseline algorithms. For -means clustering with outliers, Algorithm 3 (except for the setting with ) and Algorithm 4 can achieve the results close to the best of the three baseline algorithms. DataSummary and Algorithm 4 achieve comparable running times, but Algorithm 4 outperforms DataSummary with respect to the objective values on the synthetic datasets.
The sample size . We also study the influence of the sample size on the experimental performances of our algorithms. We vary the size from to and run our algorithms on the synthetic datasets ( for Algorithm 1 and 3). We show the results in Figure 5. We can see that their performances stay stable when varying .
The influences of and for Algorithm 2 and 4. We study the influence of on the performances. We vary from to , where is the expected number of outliers in . We also study the stability of the obtained result if we just run Algorithm 2 (resp., Algorithm 4) by one time. We run each of them times on the synthetic datasets and show the results including the average objective values and standard deviations. From Figure 6, we can see that the performances are quite stable when varying and .
Precision and purity. To further evaluate our experimental results, we compute the measures precision and purity, which have been widely used before [37]. The precision is the proportion of the ground-truth outliers found by the algorithm (i.e., , where is the set of returned outliers and is the set of ground-truth outliers). For each obtained cluster, we assign it to the ground-truth cluster which is most frequent in the obtained cluster, and the purity measures the accuracy of this assignment. Specifically, let be the ground-truth clusters and be the obtained clusters from the algorithm; the purity is equal to . The experimental results suggest that our algorithms can achieve the precisions and the purities comparable to those of the baselines. Due to the space limit, we leave the details to Section 12.
8 Future Work
Following this work, an interesting question is that whether the significance criterion can be applied to analyze the performance of uniform sampling for other well-known optimization problems, such as PCA with outliers [5] and projective clustering with outliers [20].
9 Proof of Lemma 1
Lemma 1 can be directly obtained through the following claim (we need to replace by in Claim 1, for taking the union bound over all the clusters).
Claim 1
Let be a set of elements and with . Given , one uniformly selects a set of elements from at random. (i) If , with probability at least , contains at least one element from . (ii) If , with probability at least , we have \big{|}|S\cap V|-\tau|S|\big{|}\leq\delta\tau|S|.
Proof
Actually, (i) is a folklore result having been presented in several papers before (such as [15]). Since each sampled element falls in with probability , we know that the sample contains at least one element from with probability . Therefore, if we want , should be at least .
(ii) can be proved by using the Chernoff bound [1]. Define random variables : for each , if the -th sampled element falls in , otherwise, . So for each . As a consequence, we have
[TABLE]
If , with probability at least , \big{|}\sum^{|S|}_{i=1}y_{i}-\tau|S|\big{|}\leq\delta\tau|S| (i.e., \big{|}|S\cap V|-\tau|S|\big{|}\leq\delta\tau|S|). ∎
10 Extensions of Theorem 4.1 and 4.2
The results of Theorem 4.1 and 4.2 can be easily extended to -median clustering with outliers in Euclidean space by using almost the same idea, where the only difference is that we can directly use triangle inequality in the proof (e.g., the inequality (11) is replaced by ); the coefficients and are reduced to be \big{(}1+(1+c)\frac{1+\delta}{1-\delta}\big{)} and (resp., \big{(}1+(1+c)\frac{t}{t-1}\frac{1+\delta}{1-\delta}\big{)} and ) in Theorem 4.1 (resp., Theorem 4.2), respectively.
To solve the metric -median/means clustering with outliers problems for an instance , we should keep in mind that the cluster centers can only be selected from the vertices of . However, the optimal cluster centers may not be contained in the sample , and thus we need to modify our analysis slightly. We observe that the sample contains a set of vertices close to with certain probability. Specifically, for each , there exists a vertex such that (or \big{(}d(o^{\prime}_{j},o^{*}_{j})\big{)}^{2}\leq O(1)\times\frac{1}{|C^{*}_{j}|}\sum_{p\in C^{*}_{j}}\big{(}d(p,o^{*}_{j})\big{)}^{2}) with constant probability (this claim can be easily proved by using the Markov’s inequality). Consequently, we can use to replace in our analysis, and achieve the similar results as Theorem 4.1 and 4.2.
11 The Experimental Results for -median Clustering with Outliers
12 Precision and Purity
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Alon and J. H. Spencer. The probabilistic method . John Wiley & Sons, 2004.
- 2[2] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
- 3[3] P. Awasthi and M.-F. Balcan. Center based clustering: A foundational perspective. 2014.
- 4[4] M. Bădoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings of the ACM Symposium on Theory of Computing (STOC) , pages 250–257, 2002.
- 5[5] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM , 58(3):11:1–11:37, 2011.
- 6[6] M. Ceccarello, A. Pietracaprina, and G. Pucci. Solving k-center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially. PVLDB , 12(7):766–778, 2019.
- 7[7] D. Chakrabarty, P. Goyal, and R. Krishnaswamy. The non-uniform k-center problem. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy , pages 67:1–67:15, 2016.
- 8[8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR) , 41(3):15, 2009.
