Density-based Clustering with Best-scored Random Forest
Hanyuan Hang, Yuchao Cai, and Hanfang Yang

TL;DR
This paper introduces a novel density-based clustering algorithm called 'best-scored clustering forest' that selects the best-performing random tree to determine optimal clusters, with proven consistency and competitive accuracy.
Contribution
It proposes a new ensemble clustering method that guarantees consistency and convergence, improving upon existing density-based clustering techniques.
Findings
Algorithm achieves high accuracy on synthetic data
Demonstrates competitive performance on real benchmark datasets
Proven theoretical guarantees of consistency and convergence
Abstract
Single-level density-based approach has long been widely acknowledged to be a conceptually and mathematically convincing clustering method. In this paper, we propose an algorithm called "best-scored clustering forest" that can obtain the optimal level and determine corresponding clusters. The terminology "best-scored" means to select one random tree with the best empirical performance out of a certain number of purely random tree candidates. From the theoretical perspective, we first show that consistency of our proposed algorithm can be guaranteed. Moreover, under certain mild restrictions on the underlying density functions and target clusters, even fast convergence rates can be achieved. Last but not least, comparisons with other state-of-the-art clustering methods in the numerical experiments demonstrate accuracy of our algorithm on both synthetic data and several benchmark real…
| Datasets | ADP-Cluster | DBSCAN | -means | PDF-Cluster | Ours |
| aniso-blob | 0.519693602 | 1 | 0.590986466 | 0.992013365 | 1 |
| noisy circles | 0.165814882 | 1 | 0.156841194 | 0.189604897 | 1 |
| noisy moons | 0.491529298 | 1 | 0.50093108 | 1 | 1 |
| varied blob | 0.844780818 | 0.906919484 | 0.824744996 | 0.939402597 | 0.936238496 |
| Datasets | ADP-Cluster | DBSCAN | -means | PDF-Cluster | Ours |
| appendicitis | 0.525568993 | 0.419405321 | 0.318301412 | 0.468845316 | 0.518105812 |
| customers | 0.260017768 | 0.196905491 | 0.400496777 | 0.012502467 | 0.625997132 |
| flea | 0.879214629 | 1 | 0.957543737 | 1 | 0.973650744 |
| iris | 0.568115942 | 0.61410887 | 0.716342113 | 0.568115942 | 0.778123403 |
| oliveoil | 0.572903083 | 1 | 0.627921018 | 0.865827662 | 1 |
| wifi localization | 0.914080542 | 0.869141447 | 0.314316461 | 0.232882926 | 0.909102543 |
| wine | 0.817666167 | 0.847096681 | 0.622913 | 0.845786696 | 0.872752411 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition · Bayesian Methods and Mixture Models
Density-based Clustering with
Best-scored Random Forest
Hanyuan Hang*†*
Yuchao Cai*‡*
and Hanfang Yang*‡*111Corresponding author. E-Mail: [email protected].
*†*Institute of Statistics and Big Data
Renmin University of China
*‡*School of Statistics
Renmin University of China
Abstract
Single-level density-based approach has long been widely acknowledged to be a conceptually and mathematically convincing clustering method. In this paper, we propose an algorithm called best-scored clustering forest that can obtain the optimal level and determine corresponding clusters. The terminology best-scored means to select one random tree with the best empirical performance out of a certain number of purely random tree candidates. From the theoretical perspective, we first show that consistency of our proposed algorithm can be guaranteed. Moreover, under certain mild restrictions on the underlying density functions and target clusters, even fast convergence rates can be achieved. Last but not least, comparisons with other state-of-the-art clustering methods in the numerical experiments demonstrate accuracy of our algorithm on both synthetic data and several benchmark real data sets.
Keywords: cluster analysis, nonparametric density estimation, purely random decision tree, random forest, ensemble learning, statistical learning theory
1 Introduction
Regarded as one of the most basic tools to investigate statistical properties of unsupervised data, clustering aims to group a set of objects in such a way that objects in the same cluster are more similar in some sense to each other than to those in other clusters. Typical application possibilities are to be found reaching from categorization of tissues in medical imaging to grouping internet searching results. For instance, on PET scans, cluster analysis can distinguish between different types of tissue in a three-dimensional image for many different purposes (Filipovych et al., 2011) while in the process of intelligent grouping of the files and websites, clustering algorithms create a more relevant set of search results (Marco and Navigli, 2013). Because of their wide applications, more urgent requirements for clustering algorithms that not only maintain desirable prediction accuracy but also have high computational efficiency are raised. In the literature, a wealth of algorithms have already been proposed such as -means (Macqueen, 1967), linkage (Ward, 1963; Sibson, 1973; Defays, 1977), cluster tree (Stuetzle, 2003), DBSCAN (Ester et al., 1996), spectral clustering (Donath and Hoffman, 1973; Luxburg, 2007), and expectation-maximization for generative models (Dempster et al., 1977).
As is widely acknowledged, an open problem in cluster analysis is how to describe a conceptually and mathematically convincing definition of clusters appropriately. In the literature, great efforts have been made to deal with this problem. Perhaps the first definition dates back to Hartigan (1975), which is known as the single-level density-based clustering assuming i.i.d. data generated by some unknown distribution that has a continuous density and the clusters of are then defined to be the connected components of the level set given some . Since then, different methods based on the estimator and the connected components of { have been established (Cuevas and Fraiman, 1997; Maier et al., 2012; Rigollet, 2006; Rinaldo and Wasserman, 2010).
Note that the single-level approach mentioned above is easily shown to have a conceptual drawback that different values of may lead to different (numbers of) clusters, and there is also no general rule for choosing . In order to address this conceptual shortcoming, another type of the clustering algorithms, namely hierarchical clustering, where the hierarchical tree structure of the connected components for different levels is estimated, was proposed. Within this framework, instead of choosing some , the so-called cluster tree approach tries to consider all levels and the corresponding connected components simultaneously. It is worth pointing out that the advantage of using cluster tree approach lies in the fact that it mainly focuses on the identification of the hierarchical tree structure of the connected components for different levels. For this reason, in the literature, there have already been many attempts to establish their theoretical foundations. For example, Hartigan (1981) proved the consistency of a hierarchical clustering method named single linkage merely for the one-dimensional case which becomes a more delicate problem that it is only fractionally consistent in the high-dimensional case. To address this problem, Chaudhuri and Dasgupta (2010) proposed a modified single linkage algorithm which is shown to have finite-sample convergence rates as well as lower bounds on the sample complexity under certain assumptions on . Furthermore, Kpotufe (2011) obtained similar theoretical results with an underlying -NN density estimator and achieved experimental improvement by means of a simple pruning strategy that removes connected components that artificially occur because of finite sample variability. However, the notion of recovery taken from Hartigan (1981) falls short of only focusing on the correct estimation of the cluster tree structure and not on the estimation of the clusters itself, more details we refer to Rinaldo and Wasserman (2010).
So far, the theoretical foundations for hierarchical clustering algorithms such as consistency and learning rates of the existing hierarchical clustering algorithms are only valid for the cluster tree structure and therefore far from being satisfactory. As a result, in this paper, we proceed with the study of single-level density-based clustering. In the literature, recently, various results for estimating the optimal level have already been established. First of all, Steinwart (2011) and Steinwart (2015a) presented algorithms based on histogram density estimators that are able to asymptotically determine the optimal level and automatically yield a consistent estimator for the target clusters. Obviously, these algorithms are of little practical value since only the simplest possible density estimators are considered. Attempting to address this issue, Sriperumbudur and Steinwart (2012) proposed a modification of the popular DBSCAN clustering algorithm. Although consistency and optimal learning rates have been established for this new DBSCAN-type construction, the main difficulty in carrying out this algorithm is that it restricts the consideration only to moving window density estimators for -Hölder continuous densities. In addition, it’s worth noticing that none of the algorithms mentioned above can be well adapted to the case where the underlying distribution possesses no split in the cluster tree. To tackle this problem, Steinwart et al. (2017) proposed an adaptive algorithm using kernel density estimators which, however, also only performs well for low-dimensional data.
In this paper, we mainly focus on clusters that are defined as the connected components of high density regions and present an algorithm called best-scored clustering forest which can not only guarantee consistency and attain fast convergence rates, but also enjoy satisfactory performance in various numerical experiments. To notify, the main contributions of this paper are twofold: (i) Concerning with the theoretical analysis, we prove that with the help of the best-scored random forest density estimator, our proposed algorithm can ensure consistency and achieve fast convergence rates under certain assumptions for the underlying density functions and target clusters. We mention that the convergence analysis is conducted within the framework established in Steinwart (2015a). To be more precise, under properly chosen hyperparameters of the best-scored random forest density estimator Hang and Wen (2018), the consistency of the best-scored clustering forest can be ensured. Moreover, under some additional regularization conditions, even fast convergence rates can be achieved. (ii) When it comes to numerical experiments, we improve the original purely random splitting criterion by proposing an adaptive splitting method. Instead, at each step, we randomly select a sample point from the training data set and the to-be-split node is the one which this point falls in. The idea behind this procedure is that when randomly picking sample points from the whole training data set, nodes with more samples will be more likely to be chosen whereas nodes containing fewer samples are less possible to be selected. In this way, the probability to obtain cells with sample sizes evenly distributed will be much greater. Empirical experiments further show that the adaptive/recursive method enhances the efficiency of our algorithm since it actually increases the effective number of splits.
The rest of this paper is organized as follows: Section 2 introduces some fundamental notations and definitions related to the density level sets and best-scored random forest density estimator. Section 3 is dedicated to the exposition of the generic clustering algorithm architecture. We provide our main theoretical results and statements on the consistency and learning rates of the proposed best-scored clustering forest in Section 4, where the main analysis aims to verify that our best-scored random forest could provide level set estimator that has control over both its vertical and horizontal uncertainty. Some comments and discussions on the established theoretical results will be also presented in this section. Numerical experiments conducted upon comparisons between best-scored clustering forest and other density-based clustering methods are given in Section 5. All the proofs of Section 3 and Section 4 can be found in Section 6. We conclude this paper with a brief discussion in the last section.
2 Preliminaries
In this section, we recall several basic concepts and notations related to clusters in the first subsection while in the second subsection we briefly recall the best-scored random forest density estimation proposed recently by Hang and Wen (2018).
2.1 Density Level Sets and Clusters
This subsection begins by introducing some basic notations and assumptions about density level sets and clusters. Throughout this paper, let be a compact and connected subset, be the Lebesgue measure with . Moreover, let be a probability measure that is absolutely continuous with respect to and possess a bounded density with support . We denote the centered hypercube of with side length by where
[TABLE]
and the complement of is written by .
Given a set , we denote by its interior, its closure, its boundary, and its diameter. Furthermore, for a given , denotes the distance between and . Given another set , we denote by the symmetric difference between and . Moreover, stands for the indicator function of the set .
We say that a function is -Hölder continuous, if there exists a constant such that
[TABLE]
To mention, it can be apparently seen that is constant whenever .
Finally, throughout this paper, we use the notation to denote that there exists a positive constant such that , for all .
2.1.1 Density Level Sets
In order to find a notion of density level set which is topologically invariant against different choices of the density of the distribution , Steinwart (2011) proposes to define a density level set at level by
[TABLE]
where stands for the support of , and the measure is defined by
[TABLE]
where denotes the Borel -algebra of . According to the definition, the density level set should be closed. If the density is assumed to be -Hölder continuous, the above construction could be replaced by the usual without changing our results.
Here, some important properties of the sets , are useful:
- (i)
Level Sets. 2. (ii)
Monotonicity. for all . 3. (iii)
Regularity. . 4. (iv)
Normality. , where and . 5. (v)
Open Level Sets. .
2.1.2 Comparison of Partitions and Notations of Connectivity
Before introducing the definition of clusters, some notions related to the connected components of level sets are in need. First of all, we give the definition that compares different partitions.
Definition 2.1**.**
Let be nonempty sets with , and and be partitions of and , respectively. Then is said to be comparable to , if for all , there exists a such that . In this case, we write .
It can be easily deduced that is comparable to , if no cell is broken into pieces in . Let and be two partitions of , then we call is finer than if and only if . Moreover, as is demonstrated in Steinwart (2015b), for two partitions and with , there exits a unique map such that for . We call the cell relating map (CRM) between and .
Now, we give further insight into two vital examples of comparable partitions coming from connected components. Recall that an is topologically connected if, for every pair of relatively closed disjoint subsets of with , we have or . The maximal connected subsets of are called the connected components of . As is widely acknowledged, these components make up a partition of , and we denote it by . Furthermore, for a closed with , we have .
The next example describes another type of connectivity, namely -connectivity, which can be considered as a discrete version of path-connectivity. For the latter, let us fix a and . Then, are called -connected in , if there exists such that , and for all . Clearly, being -connected gives an equivalence relation on . To be specific, the resulting partition can be written as , and we call its cells the -connected components of . It can be verified that, for all and , we always have , see Lemma A.2.7 in Steinwart (2015b). In addition, if , then we have for all sufficiently small , see Section 2.2 in Steinwart (2015a).
2.1.3 Clusters
Based on the concept established in the preceding subsections we now recall the definition of clusters, see also Definition 2.5 in Steinwart (2015b).
Definition 2.2** (Clusters).**
Let be a compact and connected set, and be a -absolutely continuous distribution. Then can be clustered between and , if is normal and for all , the following three conditions are satisfied:
- (i)
We have either or ; 2. (ii)
If we have , then ; 3. (iii)
If we have , then and .
Using the CRMs ; we then define the clusters of by
[TABLE]
where and are the two topologically connected components of . Finally, we define
[TABLE]
To illustrate, the above definition ensures that the level set below are connected, while there are exactly two components in the level sets for a certain range above . To notify, any two level sets between this range are supposed to be comparable. As a result, the topological structure between and can be determined by that of . In this manner, the connected components of , can be numbered by the connected components of . This numbering procedure can be clearly reflected from the definition of the clusters as well as that of the function , which in essence measures the distance between the two connected components at level .
Concerning that the quantification of uncertainty of clusters is indispensable, we need to introduce for , , the sets
[TABLE]
In other words, can be recognized as adding a -tube to , while is treated as removing a -tube from . We are expected to avoid cases where the density level sets have bridges or cusps that are too thin. To be more precise, recall that for a closed , the function is defined by
[TABLE]
Particularly, for all , we have for all , and if , then . Consequently, according to Lemma A.4.3 in Steinwart (2015b), for all with and all , we have
[TABLE]
whenever is contained in some compact and .
With the preceding preparations, we now come to the following definition excluding bridges and cusps which are too thin.
Definition 2.3**.**
Let be a compact and connected set, and be a -absolutely continuous distribution that is normal. Then we say that has thick level sets of order up to the level , if there exits constants and such that, for all and , we have
[TABLE]
In this case, we call the thickness function of .
In order to describe the distribution we wish to cluster, we now make the following assumption based on all concepts introduced so far.
Assumption 2.1*.*
The distribution with bounded density is able to be clustered between and . Moreover, has thick level sets of order up to the level . The corresponding thickness function is denoted by and the function defined in (2.1) is abbreviated as .
In the case that all level sets are connected, we introduce the following assumption to investigate the behavior of the algorithm in situations in which cannot be clustered.
Assumption 2.2*.*
Let be a compact and connected set, and be a -absolutely continuous distribution that is normal. Assume that there exist constants , , and such that for all and , the following conditions hold:
- (i)
.
- (ii)
If then .
- (iii)
If , then for all non-empty and all .
- (iv)
For each there exists a with .
2.2 Best-scored Random Forest Density Estimation
Considering the fact that the density estimation should come first before the analysis on the level sets, we dedicate this subsection to the methodology of building an appropriate density estimator. Different from the usual histogram density estimation (Steinwart, 2015a) and kernel density estimation (Steinwart et al., 2017), this paper adopts a novel random forest-based density estimation strategy, namely the best-scored random forest density estimation proposed recently by Hang and Wen (2018).
2.2.1 Purely Random Density Tree
Recall that each tree in the best-scored random forest is established based on a purely random partition followed the idea of Breiman (2000). To give a clear description of one possible construction procedure of this purely random partition, we introduce the random vector as in Hang and Wen (2018), which represents the building mechanism at the -th step. To be specific,
denotes the to-be-split cell at the -th step chosen uniformly at random from all cells formed in the -th step;
stands for the dimension chosen to be split from in the -th step where each dimension has the same probability to be selected, that is, are i.i.d. multinomial distributed with equal probabilities;
is a proportional factor standing for the ratio between the length of the newly generated cell in the -th dimension after the -th split and the length of the being-cut cell in the -th dimension. We emphasize that are i.i.d. drawn from the uniform distribution .
In this manner, the above splitting procedure leads to a so-called partition variable with probability measure of denoted by , and any specific partition variable can be treated as a splitting criterion. Moreover, for the sake of notation clarity, we denote by the collection of non-overlapping cells formed after conducting splits on following . This can be further abbreviated as which exactly represents a random partition on . Accordingly, we have , and for certain sample , the cell where it falls is denoted by .
In order to better characterize the purely random density tree, we give another expression of the random partition on , which is where represents one of the resulting cells of this partition. Based on this partition, we can build the random density tree with respect to probability measure on , denoted as , defined by
[TABLE]
where unless otherwise stated, we assume that for all , the Lebesgue measure . In this regard, when taking , the density tree decision rule becomes
[TABLE]
where . When taking to be the empirical measure , we obtain
[TABLE]
and hence the density tree turns into
[TABLE]
2.2.2 Best-scored Random Density Trees and Forest
Considering the fact that the above partitions completely make no use of the sample information, the prediction results of their ensemble forest may not be accurate enough. In order to improve the prediction accuracy, we select one partition for tree construction out of candidates with the best density estimation performance according to certain performance measure such as ANLL (Hang and Wen, 2018, Section 5.4). The resulting trees are then called the best-scored random density trees.
Now, let , be the best-scored random density tree estimators generated by the splitting criteria respectively, which is defined by
[TABLE]
where is a random partition of . Then the best-scored random density forest can be formulated by
[TABLE]
and its population version is denoted by .
3 A Generic Clustering Algorithm
In this section, we present a generic clustering algorithm, where the clusters are estimated with the help of a generic level set estimator which can be specified later by histogram, kernel, or random forest density estimators. To this end, let the optimal level and the resulting clusters , for distributions be as in Definition 2.2, and the constant be as in Assumption 2.2. The goal of this section is to investigate whether or is possible to be estimated and , can be clustered.
Let us first recall some more notations introduced in Section 2. For a -absolutely continuous distribution , let the level , the level set , , and the function be as in Definition 2.2. Furthermore, for a fixed set , its -tubes and are defined by (2.2). Moreover, concerning with the thick level sets, the constant and the function are introduced by Definition 2.3.
In what follows, let always be a decreasing family of sets such that
[TABLE]
holds for all .
The following theorem relates the component structure of a family of level sets estimators , which is a decreasing family of subsets of , to the component structure of certain sets , more details see e.g., Steinwart (2015a).
Theorem 3.1**.**
Let Assumption 2.1 hold. Furthermore, for , let , , , and be as in (3.1). Then, for all and the corresponding CRMs , the following disjoint union holds:
[TABLE]
From Theorem 3.1 we see that for suitable , , and , all -connected components of are either contained in , or vanish at level . Accordingly, carrying out these steps precisely, we obtain a generic clustering strategy shown in Algorithm 1.
Under Assumptions 2.1 and 2.2, the following theorem bounds the level and the components , , and the start level and the corresponding single cluster , respectively, which are outputs returned by Algorithm 1.
Theorem 3.2**.**
(i) Let Assumption 2.1 hold. For , let , , , and satisfy (3.1) for all . Then, for any data set , the following statements hold for Algorithm 1:
- (a)
The returned level satisfies both and
[TABLE]
- (b)
The returned sets , , can be ordered such that
[TABLE]
Here, , , are ordered in the sense of .
(ii) Let Assumption 2.2 hold. Moreover, let , be fixed, , and satisfy (3.1) for all . If , then Algorithm 1 returns the start level and the corresponding single cluster such that
[TABLE]
where .
The above analysis is mainly illustrated on the general cases where we assume that the underlying density has already been successfully estimated. Therefore, in the following, we delve into the characteristic of components structure and other properties of clustering algorithm under the condition where the density is estimated by the forest density estimator (2.4).
Note that one more notation is necessary for clear understanding: One way to define level set estimators with the help of the forest density estimator (2.4) is a simple plug-in approach, which is
[TABLE]
However, these level set estimators are too complicated to compute the -connected components in Algorithm 1. Instead, we take level set estimators of the form
[TABLE]
The following theorem shows that some kind of uncertainty control of the form (3.1) is valid for level set estimators of the form (3.3) induced by the forest density estimator (2.4).
Theorem 3.3**.**
Let be a -absolutely continuous distribution on and be the forest density estimator (2.4) with . For any , , that is, is one of the cells in the -th partition, there exists a constant such that . Then, for all and , there holds
[TABLE]
Before we present the next theorem, recall that denotes half of the side length of the centered hypercube in and denotes the number of trees in the best-scored random forest.
Theorem 3.4**.**
Let be a -absolutely continuous distribution on . For , , , , we choose an satisfying
[TABLE]
where is defined by
[TABLE]
Furthermore, for and , we choose a with and assume this satisfying and . Moreover, for each random density tree, we pick the number of splits satisfying
[TABLE]
*If we feed Algorithm 1 with parameters , , , and as in (3.3), then the following statements hold:
(i) If satisfies Assumption 2.1 and there exists an satisfying*
[TABLE]
then with probability not less than , the following statements hold:
- (a)
The returned level satisfies both and
[TABLE]
- (b)
The returned sets , , can be ordered such that
[TABLE]
Here, , , are ordered in the sense of .
(ii) If satisfies Assumption 2.2 and , then
[TABLE]
holds with probability not less than for the returned level and the corresponding single cluster , where .
4 Main Results
In this section, we present main theoretical results of our best-scored clustering forest on the consistency as well as convergence rates for both the optimal level and the true clusters , , simultaneously using the error bounds derived in Theorem 3.2 and Theorem 3.4, respectively. We also present some comments and discussions on the obtained theoretical results.
4.1 Consistency for Best-scored Clustering Forest
Theorem 4.1** (Consistency).**
Let Assumption 2.1 hold. Furthermore, for certain constant , assume that , , , and are strictly positive sequences converging to zero satisfying for sufficiently large , , . Moreover, let the number of splits satisfy
[TABLE]
where and . If we feed Algorithm 1 with parameters , , as in (3.3), and , then the following statements hold:
- (i)
If satisfies Assumption 2.1, then for all , the returned level satisfies
[TABLE]
Moreover, if , then for all , the returned sets , , satisfy
[TABLE]
- (ii)
If satisfies Assumption 2.2 and , then for all , the returned level satisfies
[TABLE]
Moreover, if , then for all , the returned set satisfies
[TABLE]
4.2 Convergence Rates for Best-scored Clustering Forest
In this subsection, we derive the convergence rates for both estimation problems, that is, for estimating the optimal level and the true clusters , , in our proposed algorithm separately.
4.2.1 Convergence Rates for Estimating the Optimal Level
In order to derive the convergence rates for estimating the optimal level , we need to make following assumption that describes how well the clusters are separated above .
Definition 4.1**.**
Let Assumption 2.1 hold. The clusters of are said to have separation exponent if there exists a constant such that
[TABLE]
holds for all . Moreover, the separation exponent is called exact if there exists another constant such that
[TABLE]
holds for all .
The separation exponent describes how fast the connected components of the approach each other for and a distribution having separation exponent also has separation exponent for all . If the separation exponent , then the clusters and do not touch each other. With the above Definition 4.1, we are able to establish error bounds for estimating the optimal level in the following theorem whose proof is quite similar to that of Theorem 4.3 in Steinwart (2015a) and hence will be omitted.
Theorem 4.2**.**
Let Assumption 2.1 hold, and assume that has a bounded -density whose clusters have separation exponent . For , , , , we choose an satisfying
[TABLE]
with as in (3.6). Furthermore, for and , we choose a with and assume this satisfying and . Moreover, for each random density tree, we pick the number of splits satisfying
[TABLE]
Finally, suppose that satisfies . If we feed Algorithm 1 with parameters , , , as in (3.3), and , then the returned level satisfies
[TABLE]
with probability not less than . Moreover, if the separation exponent is exact and , then we have
[TABLE]
Corollary 4.3** (Convergence Rates for Estimating the Optimal Level).**
Let Assumption 2.1 hold and suppose that is -Hölder continuous with exponent whose clusters have separation exponent . For any , and all , let , , , and be sequences with
[TABLE]
where , and . Moreover, we choose the number of splits as
[TABLE]
If we feed Algorithm 1 with parameters , , , , and , then for all sufficiently large n, there exists a constant such that the returned level satisfies
[TABLE]
Moreover, if the separation exponent is exact, there exists another constant such that for all sufficiently large , there holds
[TABLE]
4.2.2 Convergence Rates for Estimating the True Clusters
Our next goal is to establish learning rates for the true clusters, in other words, describing how fast goes to [math]. On account that this is a modified level set estimation problem, we need to make some further assumptions on . The first definition can be considered as a one-sided variant of a well-known condition introduced by (Polonik, 1995, Theorem 3.6).
Definition 4.2**.**
Let be a measure on and be a distribution on that has a -density . For a given level , we say that has flatness exponent if there exists a constant such that for all , we have
[TABLE]
It can be easily observed from (4.3) that the larger is, the steeper approaches from above. Particularly, in the case of , the density is allowed to take the value , otherwise it would be bounded away from .
The next definition describes the roughness of the boundary of the clusters, see also Definition 4.6 in Steinwart (2015a).
Definition 4.3**.**
Let Assumption 2.1 hold. Given some , we say that the clusters have an -smooth boundary if there exists a constant such that for all and , there holds
[TABLE]
where , denote the connected components of the level set .
Note that considering does not make sense in and if has rectifiable boundary, we always have , see Lemma A.10.4 in Steinwart (2015b).
Now, we summarize all the conditions on needed to obtain learning rates for cluster estimation.
Assumption 4.1*.*
Let Assumption 2.1 hold. Moreover, assume that has a bounded -density and a flatness exponent at level , whose clusters have an -smooth boundary for some and a separation exponent .
The following theorem provides a finite sample bound that can be later used to describe how well our algorithm estimates the true clusters , , see also Theorem 4.7 in Steinwart (2015a).
Theorem 4.4**.**
Let Assumption 4.1 hold. Furthermore, let be defined as in Theorem 4.2. For , , , , if we feed Algorithm 1 with parameters , , , , and as in Theorem 4.2, then the returned level satisfy inequalities (4.1) and the returned sets , satisfy
[TABLE]
with probability not less than .
Note that if the separation exponent is exact and finite, then the inequality (4.2) also holds for the returned level . Moreover, if and are of finite values, then the bound in Theorem 4.4 behaves like
[TABLE]
and the convergence rates are presented in the following corollary.
Corollary 4.5** (Convergence Rates for Estimating the True Clusters).**
Let Assumption 4.1 hold. Furthermore, for , let , , , , and be sequences with
[TABLE]
where , , , and . If we feed Algorithm 1 with parameters , , , as in (3.3), and , then there exists a constant such that the returned sets , , satisfy
[TABLE]
4.3 Comments and Discussions
This subsection presents some comments and discussions on the established learning rates for estimating the optimal level and the true clusters , .
First of all, let us compare our convergence rates for estimating the optimal level with existing convergence rates in the literature. Corollary 4.3 tells us that for any , our learning rate is of the form
[TABLE]
where , . In contrast, Steinwart (2015a) has shown that the clustering algorithm using histogram density estimator learns with the rate
[TABLE]
Simple algebraic calculations show that if is sufficiently small and , then this rate will be slower than ours. However, if the best separation exponent , that is, the clusters and do not touch each other, then our learning rate becomes
[TABLE]
which turns out to be slower than the rate established in Steinwart (2015a).
On the other hand, concerning with the learning rates for estimating the true clusters, Corollary 4.5 shows that our algorithm learns with rate
[TABLE]
where and . Obviously, this rate is strictly slower than the rate derived by Steinwart (2015a). Nevertheless, in the case of , it can be easily shown that our rate is faster than the rate established in Sriperumbudur and Steinwart (2012).
Note that if Assumption 4.1 holds with and , then the convergence rates for estimating and the clusters can be achieved simultaneously. In contrast, in the case of , the estimation of is easier than the estimation of the level set , more detailed discussion can be found in Steinwart (2015a).
Finally, we mention that in general, our convergence rates can be slower than other clustering algorithms due to the nature of random partition, which in turn leads to diversity and thus accuracy of our clustering algorithm.
5 Experimental Performance
In this section, we first summarize the proposed best-scored clustering forest algorithm in Subsection 5.1, and discuss the model selection problem of various clustering algorithms in Subsection 5.2. Then we compare our clustering algorithm with other proposals both on synthetic data in Subsection 5.3 and real data sets in Subsection 5.4, respectively.
5.1 Algorithm Construction
Our proposed best-scored clustering forest algorithm is presented in detail in Algorithm 2. In order to measure the similarity between two data clusterings, we adopt the adjusted rand index (ARI) through all experiments which can be formulated as follows: Given a set of n elements and two clusterings of these elements, namely and , the overlap between and can be summarized with which stands for the number of objects in set , , and . Then the Adjusted Rand Index is defined as
[TABLE]
Finally, it’s worth mentioning that in order to improve the efficiency and accuracy of our clustering algorithm, we also employ the adaptive splitting method (see Figure 7) which is proposed for the density estimation problem, more details please refer to Section 5.1 in Hang and Wen (2018).
5.2 Experimental Setup
In our experiments, we compare the clusters with true classes generated by computing the following performance measures ARI (adjusted rand index) of different approaches. We conduct comparisons among some baseline density-based methods including Fast Clustering Using Adaptive Density Peak Detection (ADP-Cluster), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), -means and PDF-Cluster.
- •
ADP-Cluster: The algorithm is built and improved upon the idea of Xu and Wang (2016) by finding density peaks in a density-distance plot generated from local multivariate Gaussian density estimation. There are two leading parameters: the bandwidths of the multivariate kernel density and the number of the clusters determined automatically by validation criterion.
- •
DBSCAN: The algorithm can be traced back to Ester et al. (1996). It is also a density-based clustering non-parametric algorithm while it groups points that are closely packed together. The algorithm requires two parameters: and the minimum number of points minPts required to form a dense region.
- •
-means: The only parameter in -means is the number of cluster . The idea goes back to Macqueen (1967) and is popular for cluster analysis in data mining. It is significant to run diagnostic checks for determining the number of clusters in the data set.
- •
PDF-Cluster: The leading parameters in the algorithm are as bandwidth of kernel density estimation selected by least-square cross validation and as tolerance threshold to set edges between two observations. The idea was proposed by Menardi and Azzalini (2014) developing a viable solution to the problem of finding connected sets in higher dimensional spaces.
To notify, more free parameters are alternative in the best-score clustering forest algorithm compared with other methods. To be specific, these free parameters include the number of density trees in the forest , the ratio of number of splits for trees in the forest to the sample size , the positive number for selecting low-density points as background points, the positive integer to allocate background points to clusters with -NN classification as well as the number of clusters .
For DBSCAN, the parameter is picked from to by , minPts is default and is picked from . For -means, the parameter is selected from . For PDF-Cluster, the parameter is selected from 0.01 to 0.51 by 0.01 and for our method, the parameter is set to be 100, the ratio is selected from and is selected from -quantile of the pairwise distances , where is chosen from , , , , , , , , the parameter of -NN is selected from and the number of clusters is selected from . It’s worth pointing out that both DBSCAN and our method assigns only a fraction of points to clusters (the foreground points), while leaving low-density observations (background points) unlabeled. Therefore, assigning the background points to clusters can be done with -NN algorithm. In our experiment, for the algorithm with determined results, the performance is reported with the best parameter setting while for the algorithm with stochastic results, the experiment is repeated 10 times and the average performance is reported with the best parameter setting.
We simply use the Python-package scikit-learn for DBSCAN and -means and R package for ADP-Cluster and PDF-Cluster.
5.3 Synthetic Data
In this subsection, we apply the density-based clustering methods mentioned above on four artificial examples. To be specific, we simulate four two-dimensional toy datasets with different shapes of clusters:
- •
noisy circles: contains a large circle containing a smaller circle with two-dimensional noise;
- •
varied blob: is generated by isotropic Gaussian blobs with variant variances for clustering;
- •
noisy moons: is made up of two interleaving half circles adding standard deviation of Gaussian noise;
- •
aniso-bolb: is anisotropicly distributed, i.e., the data set is generated by anisotropic Gaussian blobs.
In order to see the scalability of these algorithms, we choose the size big enough (), but not too big to avoid too long running time.
Table 1 reports the ARI of our clustering algorithm and other clustering methods with the best parameter setting over four toy datasets. It can be evidently observed from the Table 1 that our algorithm has the best ARI performances on almost all data sets, further demonstrating the effectiveness of the algorithm. Our algorithm as well as DBSCAN recognizes the correct clusters on three data sets: aniso-blob, noisy circles, and noisy moons.
5.4 Real Data Analysis
In our experiment, to assess the performance of various clustering methods, we evaluate the ARI among ADP-Cluster, DBSCAN, -means and PDF-Cluster and our best-scored clustering forest on the following real data sets from UCI and Kaggle:
- •
Appendicitis: The appendicitis data collected in the medical field was first put forward in Weiss and Kulikowski (1991). The whole data represents medical measures taken over patients on which the class label represents if the patient has appendicitis (class label ) or not (class label [math]).
- •
Customers: The data set refers to clients of a wholesale distributor including the annual spending in monetary units on diverse product categories. This database available on UCI contains observations of dimension representing attributes such as fresh, milk, grocery, frozen, etc.
- •
Flee-beetles: For three species of flea-beetles: concinna, heptapotamica, and heikertingeri, the whole data set was collected with six measurements: tars1, tars2, head, aede1, aede2 and aede3. The whole data set consists of samples.
- •
Iris: Regarded as one of the best known database shown in the pattern recognition literature, Iris contains classes of instance each, where each class refers to a type of iris plant. The learning goal is to group the iris data with four features: sepal length, sepal width, petal length, and petal width into the true classes.
- •
Oliveoil: Oliveoil comprises observations from oil analysis using measurements of different specimen for olive oil produced in various regions in Italy which can be further divided into three macro-areas: Centre-North, South, Sardinia. This -dimensional input data represent attributes such as palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic. The learning task is to reconstruct the macro-area membership.
- •
Wifi-localization: The database comprising observations was collected in indoor space by observing signal strengths of seven WiFi signals visible on a smartphone. The experiment was performed to explore how wifi signal strengths can be used to determine one of the indoor locations.
- •
Wine: This data set including observations are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of constituents found in each of the three types of wines.
Table 2 summaries the ARI on the real data sets mentioned above. Careful observations will find that for most of these data sets, the best-scored forest clustering has significantly larger ARI than other density-based clustering methods. This superiority in cluster accuracy may be attributed to both the density estimation accuracy resulted from general architecture of random forest and the advantage of the density-based clustering method to group the data into arbitrarily shaped clusters. We mention that interested readers can further tune the free parameters and we believe that more accurate results could be obtained.
6 Proofs
Proof of Theorem 3.3.
(i) Let us first prove the inclusion . To this end, we fix an , then we have , that is, for all , we have . In other words, if satisfying , then we have .
Now we show that for all , we have
[TABLE]
whose proof will be conducted in the following by contradiction. Suppose that there exists a sample with and . If we denote as the unique cell of the partition of the -th tree in the forest where falls, then the assumption implies that for any , there holds
[TABLE]
and consequently we have , i.e., for . This together with the normality of yields
[TABLE]
which leads to
[TABLE]
Consequently, we have
[TABLE]
By and , we find , which contradicts (6.2). Therefore, for all , we have or .
Next, we show that there exist a sample such that by contradiction. If we denote as the unique cell of the partition of the -th tree in the forest where falls, then for all , , we have , and consequently , . This leads to , which contradicts with the condition . Therefore, we conclude that there exists a sample satisfying . This together with (6.1) implies , which means . This finishes the proof of .
(ii) To prove the second inclusion , let us fix an , then there exists satisfying and . Moreover, since , we have
[TABLE]
Now we are able to prove the inclusion by contradiction. Suppose that . Since , then we have . If stands for the unique cell of the partition of the -th tree in the forest where falls, since , we thus have . This together with the normality of yields
[TABLE]
which leads to
[TABLE]
Consequently we have
[TABLE]
which contradicts (6.3). Therefore, we conclude that . This completes the proof of . ∎
Proof of Theorem 3.4.
The proof can be conducted by applying Theorem 3.2 directly and hence we need to verify its assumptions.
Let us first prove that if , , and , then we have . To this end, we define a set by
[TABLE]
Obviously, we have , since . This implies that there exists an such that . Using the monotonicity of , we conclude that .
Next, we prove that for all , satisfy (3.1) with probability not less than . For , let the events and be defined by
[TABLE]
According to Proposition 15 and Inequality (19) in Hang and Wen (2018), there hold
[TABLE]
for all . Moreover, for the forest, we define the events and by
[TABLE]
Since the splitting criteria are i.i.d. from , then we have
[TABLE]
and consequently we obtain
[TABLE]
This proves that for all , satisfy (3.1) with probability not less than and hence all the assumptions of Theorem 3.2 are indeed satisfied. ∎
To prove Theorem 4.1 concerning with the consistency of our clustering algorithm, we need the following technical lemma.
Lemma 6.1**.**
Let , be strictly positive sequences and be the solution of equation
[TABLE]
If and , then .
Proof of Lemma 6.1.
We prove the lemma by contradiction. To this end, we assume that . Then there exists an , and a subsequence of denoted by such that hold for all . Consequently we obtain
[TABLE]
for all . This together with the condition implies that . Therefore, we have
[TABLE]
which contradicts the condition and thus the assertion is proved. ∎
Proof of Theorem 4.1.
Let the events , , , and be defined as in (6.5), (6.6), (6.7), and (6.8) respectively. According to Inequality (19) in Hang and Wen (2018), we have
[TABLE]
and consequently we obtain
[TABLE]
Since is finite and splitting criteria are i.i.d. from , we have
[TABLE]
Proposition 15 in Hang and Wen (2018) shows that
[TABLE]
where with
[TABLE]
Obviously, there exists certain such that .
Next, with the help of Lemma 6.1, we show that if , then we have with satisfying (6.9). Clearly, we have
[TABLE]
Plugging this into (6.9), we obtain
[TABLE]
Now, by setting
[TABLE]
it can be easily verified that there exist finite constants , , , and such that
[TABLE]
and
[TABLE]
Then, Lemma 6.1 with the above and implies and consequently we have
[TABLE]
Since is finite and splitting criteria are i.i.d. from , we have
[TABLE]
which completes the proof of consistency according to Theorem 3.2 and Section A.9 in Steinwart (2015b). ∎
Proof of Corollary 4.3.
For , we define
[TABLE]
Since sequences , and converge to [math], we have and for all sufficiently large , there holds
[TABLE]
Moreover, the assumed and satisfy
[TABLE]
and therefore we have
[TABLE]
for all sufficiently large . Set
[TABLE]
and denote as in (3.6). Since and , we have
[TABLE]
Consequently, for all sufficiently large , we have
[TABLE]
and therefore condition (3.5) on is satisfied. Moreover, there holds
[TABLE]
and consequently we have
[TABLE]
In other words, for all sufficiently large , there holds
[TABLE]
and therefore condition (3.7) on is satisfied.
Now, by applying Theorem 4.2, there exist an and a constant such that the right-hand side of inequalities (4.1) holds for . Moreover, if is exact, (4.2) holds for all . ∎
Proof of Corollary 4.5.
Similar as the proof of Theorem 4.2, we prove that for all sufficiently large , there holds
[TABLE]
with as in (3.6), and thus the conditions in Theorem 4.4 are all satisfied. Then, for such , by applying Theorem 4.4, we obtain
[TABLE]
Elementary calculations show that with the assumed ,, and , there exists a constant such that
[TABLE]
Obviously, we have
[TABLE]
Therefore, we can choose a constant large enough such that the desired inequality holds for all . ∎
7 Conclusion
In this paper, we present an algorithm called best-scored clustering forest to efficiently solve the single-level density-based clustering problem. From the theoretical perspective, our main results comprise statements and complete analysis of statistical properties such as consistency and learning rates. The convergence analysis is conducted within the framework established in Steinwart (2015a). With the help of best-scored random forest density estimator proposed by Hang and Wen (2018), we show that consistency of our proposed clustering algorithm can be established with properly chosen hyperparameters of the density estimators and partition diameters. Moreover, we obtain fast rates of convergence for estimating the clusters under certain mild conditions on the underlying density functions and target clusters. Last but not least, the excellence of best-scored clustering forest was demonstrated by various numerical experiments. On the one hand, the new approach provides better average adjusted rand index (ARI) than other state-of-the-art methods such as ADP-Cluster, DBSCAN, -means and PDF-Cluster on synthetic data, while providing average ARI that are at least comparable on several benchmark real data sets. On the other hand, due to the intrinsic advantage of random forest, it is to be expected that this strategy enjoys satisfactory computational efficiency by taking utmost advantage of the parallel computing.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Breiman (2000) Breiman, L. (2000). Some infinite theory for predictor ensembles. University of California at Berkeley Papers .
- 2Chaudhuri and Dasgupta (2010) Chaudhuri, K. and S. Dasgupta (2010). Rates of convergence for the cluster tree. In In Advances in Neural Information Processing Systems .
- 3Cuevas and Fraiman (1997) Cuevas, A. and R. Fraiman (1997). A plug-in approach to support estimation. Annals of Statistics 25 (6), 2300–2312.
- 4Defays (1977) Defays, D. (1977). An efficient algorithm for a complete link method. Computer Journal 20 (4), 364–366.
- 5Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39 (1), 1–38.
- 6Donath and Hoffman (1973) Donath, W. E. and A. J. Hoffman (1973). Lower bounds for the partitioning of graphs. IBM Journal of Research and Development 17 (5), 420–425.
- 7Ester et al. (1996) Ester, M., H. P. Kriegel, and X. Xu (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining .
- 8Filipovych et al. (2011) Filipovych, R., S. M. Resnick, and C. Davatzikos (2011). Semi-supervised cluster analysis of imaging data. Neuro Image 54 (3), 2185–2197.
