Reliable Agglomerative Clustering
Morteza Haghir Chehreghani

TL;DR
This paper proposes a new adaptive agglomerative clustering strategy that extracts all reliable linkages at each step, improving flexibility and density consistency, and demonstrates its effectiveness through experiments.
Contribution
It introduces a novel strategy for agglomerative clustering that considers all reliable linkages, extending standard methods and connecting to minimum spanning tree algorithms.
Findings
The new strategy improves clustering performance on real-world datasets.
It generalizes standard agglomerative clustering by extracting multiple linkages.
The approach is applicable with common linkage criteria, including single linkage.
Abstract
Standard agglomerative clustering suggests establishing a new reliable linkage at every step. However, in order to provide adaptive, density-consistent and flexible solutions, we study extracting all the reliable linkages at each step, instead of the smallest one. Such a strategy can be applied with all common criteria for agglomerative hierarchical clustering. We also study that this strategy with the single linkage criterion yields a minimum spanning tree algorithm. We perform experiments on several real-world datasets to demonstrate the performance of this strategy compared to the standard alternative.
| single | complete | average | centroid | Ward | ||||||
| dataset | stnd | rlbl | stnd | rlbl | stnd | rlbl | stnd | rlbl | stnd | rlbl |
| Ecoli | 0.0564 | 0.0564 | 0.6235 | 0.6235 | 0.5907 | 0.6812 | 0.0462 | 0.0383 | 0.5473 | 0.5445 |
| Hayes Roth | 0.0161 | 0.2336 | 0.0354 | 0.2338 | 0.1629 | 0.2338 | 0.0000 | 0.0030 | 0.0249 | 0.0808 |
| Iris | 0.5821 | 0.5821 | 0.6963 | 0.6963 | 0.6301 | 0.6301 | 0.7934 | 0.7934 | 0.7578 | 0.7578 |
| Lung Cancer | 0.0149 | 0.0149 | 0.1537 | 0.2070 | 0.0239 | 0.1413 | 0.0000 | 0.0000 | 0.1766 | 0.1684 |
| Perfume | 0.7024 | 0.7024 | 0.7332 | 0.7332 | 0.7601 | 0.7595 | 0.7544 | 0.7664 | 0.8246 | 0.8246 |
| Seeds | 0.0283 | 0.0283 | 0.6029 | 0.6029 | 0.6083 | 0.7055 | 0.6034 | 0.6140 | 0.7243 | 0.7243 |
| Wine | 0.0237 | 0.0237 | 0.4307 | 0.4307 | 0.3223 | 0.3452 | 0.3251 | 0.3251 | 0.4097 | 0.4097 |
| COMP | 0.0604 | 0.0604 | 0.1459 | 0.1459 | 0.0453 | 0.1611 | 0.0312 | 0.0341 | 0.1021 | 0.1140 |
| REC | 0.0228 | 0.0402 | 0.1793 | 0.1793 | 0.0330 | 0.2375 | 0.0161 | 0.0315 | 0.2574 | 0.2574 |
| SCI | 0.0617 | 0.0617 | 0.0823 | 0.0823 | 0.0387 | 0.1557 | 0.0339 | 0.0651 | 0.1997 | 0.3042 |
| Real I | 0.5782 | 0.5782 | 0.7114 | 0.7114 | 0.7813 | 0.8237 | 0.0785 | 0.0670 | 0.5976 | 0.7546 |
| Real II | 0.5711 | 0.5711 | 0.7430 | 0.7430 | 0.7704 | 0.8130 | 0.0458 | 0.0268 | 0.6542 | 0.8274 |
| Real III | 0.5389 | 0.5389 | 0.7581 | 0.7581 | 0.7209 | 0.7733 | 0.0132 | 0.0145 | 0.7156 | 0.8697 |
| single | complete | average | centroid | Ward | ||||||
| dataset | stnd | rlbl | stnd | rlbl | stnd | rlbl | stnd | rlbl | stnd | rlbl |
| Ecoli | 0.0386 | 0.0386 | 0.6908 | 0.6908 | 0.6974 | 0.7509 | 0.0297 | 0.0252 | 0.4686 | 0.3914 |
| Hayes Roth | 0.0185 | 0.2086 | 0.0327 | 0.2451 | 0.1620 | 0.2451 | 0.0000 | 0.0058 | 0.0496 | 0.1073 |
| Iris | 0.5638 | 0.5638 | 0.6423 | 0.6423 | 0.5659 | 0.5659 | 0.7592 | 0.7592 | 0.7312 | 0.7312 |
| Lung Cancer | 0.0371 | 0.0371 | 0.2809 | 0.3533 | 0.1327 | 0.1170 | 0.0000 | 0.0000 | 0.3388 | 0.1698 |
| Perfume | 0.4667 | 0.4667 | 0.5096 | 0.5096 | 0.5651 | 0.5600 | 0.5600 | 0.5749 | 0.6590 | 0.6590 |
| Seeds | 0.0025 | 0.0025 | 0.5461 | 0.5461 | 0.5543 | 0.7320 | 0.5664 | 0.5626 | 0.7132 | 0.7132 |
| Wine | 0.0054 | 0.0054 | 0.3708 | 0.3708 | 0.2926 | 0.3204 | 0.3266 | 0.3266 | 0.3684 | 0.3684 |
| COMP | 0.0531 | 0.0531 | 0.1331 | 0.1331 | 0.0040 | 0.1459 | 0.0119 | 0.0138 | 0.0290 | 0.0296 |
| REC | 0.0262 | 0.0742 | 0.0905 | 0.0905 | 0.0025 | 0.2266 | 0.0014 | 0.0052 | 0.2162 | 0.2162 |
| SCI | 0.0884 | 0.0884 | 0.0108 | 0.0108 | 0.0034 | 0.0782 | 0.0493 | 0.0588 | 0.0908 | 0.1688 |
| Real I | 0.4296 | 0.4296 | 0.4133 | 0.4133 | 0.4687 | 0.5699 | 0.0401 | 0.0403 | 0.2649 | 0.4969 |
| Real II | 0.4409 | 0.4409 | 0.4142 | 0.4142 | 0.5581 | 0.6685 | 0.0283 | 0.0198 | 0.3193 | 0.6176 |
| Real III | 0.4235 | 0.4235 | 0.4414 | 0.4414 | 0.6850 | 0.6443 | 0.0123 | 0.0151 | 0.4101 | 0.7042 |
| single | complete | average | centroid | Ward | ||||||
| dataset | stnd | rlbl | stnd | rlbl | stnd | rlbl | stnd | rlbl | stnd | rlbl |
| Ecoli | 0.1355 | 0.1355 | 0.6789 | 0.6789 | 0.6683 | 0.7115 | 0.1008 | 0.0819 | 0.6123 | 0.5658 |
| Hayes Roth | 0.0579 | 0.3472 | 0.0556 | 0.3010 | 0.2164 | 0.3010 | 0.0000 | 0.0203 | 0.0412 | 0.0995 |
| Iris | 0.7175 | 0.7175 | 0.7221 | 0.7221 | 0.7046 | 0.7046 | 0.8057 | 0.8057 | 0.7701 | 0.7701 |
| Lung Cancer | 0.0287 | 0.0287 | 0.1810 | 0.2303 | 0.0742 | 0.1743 | 0.0000 | 0.0000 | 0.2140 | 0.2030 |
| Perfume | 0.8117 | 0.8117 | 0.8251 | 0.8251 | 0.8437 | 0.8417 | 0.8380 | 0.8442 | 0.8796 | 0.8796 |
| Seeds | 0.0663 | 0.0663 | 0.6152 | 0.6152 | 0.6204 | 0.7094 | 0.6150 | 0.6260 | 0.7309 | 0.7309 |
| Wine | 0.0615 | 0.0615 | 0.4423 | 0.4423 | 0.4049 | 0.3920 | 0.4277 | 0.4277 | 0.4161 | 0.4161 |
| COMP | 0.0351 | 0.0351 | 0.1857 | 0.1857 | 0.0754 | 0.1922 | 0.0515 | 0.0558 | 0.1323 | 0.1468 |
| REC | 0.0307 | 0.0614 | 0.2310 | 0.2310 | 0.0609 | 0.2737 | 0.0308 | 0.0569 | 0.3124 | 0.3124 |
| SCI | 0.0518 | 0.0518 | 0.1337 | 0.1337 | 0.0714 | 0.2005 | 0.0270 | 0.0339 | 0.2546 | 0.3407 |
| Real I | 0.7708 | 0.7708 | 0.8484 | 0.8484 | 0.8409 | 0.8714 | 0.2181 | 0.2016 | 0.7932 | 0.8421 |
| Real II | 0.7570 | 0.7570 | 0.8221 | 0.8221 | 0.8361 | 0.8725 | 0.1384 | 0.0925 | 0.8023 | 0.8614 |
| Real III | 0.7197 | 0.7197 | 0.8155 | 0.8155 | 0.8427 | 0.8408 | 0.0510 | 0.0594 | 0.8238 | 0.8951 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Reliable Agglomerative Clustering
Morteza Haghir Chehreghani
Department of Computer Science and Engineering
Chalmers University of Technology
Email: [email protected]
Abstract
Standard agglomerative clustering suggests establishing a new reliable linkage at every step. However, in order to provide adaptive, density-consistent and flexible solutions, we study extracting all the reliable linkages at each step, instead of the smallest one. Such a strategy can be applied with all common criteria for agglomerative hierarchical clustering. We also study that this strategy with the single linkage criterion yields a minimum spanning tree algorithm. We perform experiments on several real-world datasets to demonstrate the performance of this strategy compared to the standard alternative.
1 Introduction
Clustering plays an essential role in data processing and management such as text processing, image segmentation, compression, summarization, knowledge management, network analysis, and bioinformatics. The goal of data clustering is to partition the data into groups such that the objects in the same cluster are more similar in some sense, compared to the inter-cluster objects. A category of clustering methods partition the data into flat clusters via for example optimizing a cost/objective function. Examples of this type of methods are -means [27], normalized cut [37] and spectral clustering [37, 31], where all produce flat clusters without any explicit relation between them. In practice, however, the different clusters often do not carry the same information content, i.e., some are more detailed than the others. Thus, in an exploratory data analysis approach, it is desired to propose the clusters at different levels and resolutions, such that both general and specific information are preserved. In this way, the user has more control to choose the desired resolution or even investigate the clusters at different levels and resolutions. For this reason, hierarchical clustering is often more practical is many applications and situations, where the results are usually presented by a dendrogram. A dendrogram is a tree wherein each node represents a cluster and its final nodes (the nodes connected to only one other node) correspond to the objects. A node at a higher level includes the combination of the lower-level clusters and the edge weights (and their lengths) represent the inter-cluster distances.
Hierarchical clustering methods, in general, fall into two categories: agglomerative (bottom-up) and divisive (top-down) [28]. Agglomerative algorithms consider each object as a separate cluster, and then combine the clusters in a greedy manner to build larger clusters, until at the end there is only one single cluster. Divisive methods, in an opposite way, start with a single cluster including all objects. Then, at each step, the clusters are divided into two parts to produce finer clusters. Agglomerative methods are more common for hierarchical clustering, and they are usually computationally more efficient than divisive methods [32]. In these approaches, the clusters might be combined or divided according to different criteria, e.g., single, complete, average, centroid and Ward.
Several methods have been developed to improve the different aspects of these algorithms. [1] studies the locality and outer consistency of agglomerative algorithms in an axiomatic way. The works in [23, 26] consider the statistical significance of hierarchical clustering. [10, 4, 35, 36, 9] investigate the optimization aspects of hierarchical clustering and develop several approximate solutions. To provide robustness in pairwise inter-clusters relations, K-Linkage in [44] investigates multiple pairs of distances for each pair of clusters, [2] uses global information for determining the similarities between the clusters, [6] trains a Bayesian network to infer the relations between the items to be clustered, and [5] suggests applying agglomerative methods to small dense subsets of the data instead of the original data. The work in [7] performs the hierarchical clustering on -nearest neighbor graph where fixing a proper (and the other hyper-parameters) can be nontrivial as discussed in [42]. The works in [22, 14] might suffer from the same issues. The methods in [13, 20] investigate combining aggolomerative methods with probabilistic models which then yields an extra computational complexity. Finally, [18, 30, 8] develop efficient (and approximate) implementations of aggolomerative methods.
In this paper, we focus on agglomerative hierarchical clustering. We consider that the standard agglomerative algorithms usually select a minimal reliable linkage at each step. We call a linkage between two clusters reliable if both clusters are the nearest neighbors of each other. Linkages represent the inter-cluster distances according to a criterion such as single or average distance. A reliable linkage provides the two clusters at its two sides to be consistent and share similar properties. However, in order to be adaptive w.r.t. the data diversity and variability, we investigate extracting at each step all the reliable linkages, instead of the smallest one. This strategy, called reliable agglomerative clustering, enables every object to potentially contribute from the early steps of constructing the dendrogram and, thus, clusters with different shapes and densities can evolve from the beginning. This strategy, similar to the standard agglomerative procedure, can be used with all the common criteria, and it is adaptive to the shape and density of the clusters. A similar idea has been proposed in [3] in an abstract form without further investigations and analysis. We note that this contribution is orthogonal to the aforementioned methods which aim to improve in particular agglomerative clustering, such that any of those improvements can be employed with this strategy too. For example, similar to [5], we may build the dendrogram from the dense subsets of the data or use global information for computing the base pairwise (dis)similarities [2]. We may also apply the feature extraction method in [19] to infer proper unsupervised representations. In the following, inspired by the equivalence of single linkage clustering and the Kruskal’s algorithm for computing minimum spanning trees [24], we study that reliable agglomerative clustering with single criterion also yields a minimum spanning tree. We perform extensive experiments on several real-world datasets to demonstrate the performance of this method compared to the standard approach.
The rest of the paper is organized as the following. In the second section we introduce the reliable agglomerative strategy, and in the third section we study the connection to minimum spanning trees. We experimentally investigate reliable agglomerative clustering in the next section, and finally, we conclude the paper in the last section.
2 Reliable Agglomerative Clustering
In this section, we describe reliable agglomerative clustering and discuss the connection to computing a minimum spanning tree.
2.1 A generic view to agglomerative clustering
Data are characterized by a set of objects and a relevant representation. The representation can be for example the vectors in a vector space or the pairwise dissimilarities between the objects. In the former case, the measurements are shown by the matrix , where the row (i.e., ) corresponds to the dimensional vector of the object. In the latter form, an matrix represents the pairwise dissimilarities between the objects. A cluster is shown by , which is the set of the object indices that it contains. The function denotes the inter-cluster distances that can be defined according to different criteria.
Agglomerative methods follow an iterative procedure where at each step, two clusters (nodes) are combined to build a larger cluster. The procedure continues until there is only one cluster left. The algorithm at each step selects the two clusters that have a minimal distance according to a criterion, i.e., a specific definition of . For example, the single linkage criterion [38] defines the distance between two clusters as the distance between the nearest members of the clusters. Opposite to this strategy, the complete linkage criterion [25] defines the distance of two clusters as the distance between their farthest members, that corresponds to the maximum within-cluster distance of the new cluster. On the other hand, in average criterion [39] the average of inter-cluster distances is used as the distance between the two clusters. Some other methods, e.g., the centroid and the median criteria, determine a representative for each cluster and then compute the inter-cluster distances by the distances between the representatives. For example, with the centroid criterion the representatives are the means of the clusters and at each step, the two clusters with closest centroids are combined to construct a larger cluster.
Another category of agglomerative methods aim to optimize a criterion such as homogeneity. An important instance is the Ward method [43] which aims to minimize the total within-cluster variance at each step. However, this criterion can be written as
[TABLE]
where denotes the centroid vector of cluster .
Thus, the Ward method also at each step combines the two clusters with a minimal distance, where the inter-cluster distances are defined as the distances between the cluster means normalized by a function of the size of the clusters.
2.2 Reliable agglomerative clustering strategy
We begin with analyzing the performance of the single linkage method, in particular on the data with diverse densities. Such an analysis can be applied to the other criteria as well. We first consider the data shown in Figure LABEL:fig:Context-Sensitive-Clust1, which includes two clusters with different densities. The single linkage method starts first from the dense data cloud at the left side (shown by black points) and then performs grouping the members of the cluster at the right side (shown by green points). Such that if we stop the clustering early, then, we will have only the members of the cluster at the left side grouped together. The reason is that picking the smallest inter-cluster distance (linkage) does not necessarily yield contributing every object/cluster to building the dendrogram. In particular, as we saw, this approach is sensitive to the density of the clusters and tends to first extract the densest clusters. One way to overcome this issue and take the variance of clusters into account is to require each object/cluster to participate in building the dendrogram. One might interpret the standard agglomerative strategy for selecting the smallest inter-cluster linkage as i) find the nearest neighbors of the current objects/clusters to obtain the set of potential linkages111The nearest neighbors are defined according to the function, which can encode any criterion (e.g., single, complete, average, centroid and Ward)., and ii) then pick the smallest linkage.
Therefore, one way to render contributing many objects/clusters in building the dendrogram is to choose all the linkages instead of the smallest one, which makes the dendrogram grow simultaneously from all the objects. However, allowing all the linkages corresponding to any nearest neighbor might be inappropriate, as it can be sensitive to the presence of outliers or to the clusters which are close but have different densities. Two examples are illustrated in Figures 1 and 1. If we pick all of the linkages, the red object at the top in Figure 1 would establish a linkage to the green cluster (with the closest object of it) at the first level of the dendrogram. However, we know that such a linkage should be established at a higher level, after the members of the green data cloud merge and build their own cluster first. Therefore, this linkage is not a reliable linkage, as the two objects at its two sides do not share similar properties and densities. On the other hand, in Figure 1, the two green and black clusters are close to each other, such that some objects of the green data cloud choose the members of the black data cloud as the nearest neighbors, instead of choosing from the green data cloud. This occurs due to the different densities of the clusters. Therefore, one should be careful in choosing any nearest neighbor linkage. In these examples, the objects/clusters at the two sides of a linkage have different properties and densities. In the example of Figures 1, the red object is an outlier whose neighborhood is empty, unlike the neighborhood of the object at the other side, which is significantly denser. Thus, the red object establishes a linkage with one of the green objects, but this object selects another object as its nearest neighbor. In Figures 1, some of the green objects establish linkage to some of the black objects, which have a different (i.e., higher) densities around. Therefore, the black objects do not select these green objects as their nearest neighbors. This analysis leads to investigate the reliability of linkages established by different objects/clusters, defined in Definition 1.
Definition 1. *A linkage between two clusters and is ‘reliable’ if and only if both clusters are nearest neighbors of each other, i.e., and , where returns the nearest clusters of cluster .*222A cluster may include only one single object, i.e., each object is a cluster at the lowest level of the dendrogram.
Note that a cluster might have several nearest neighbors, i.e., .
Therefore, instead of establishing the linkage(s) from every cluster/object, we select only a subset that are reliable. Such an approach provides the clusters at the two sides of a linkage to share consistent neighborhood and densities. Thus, merging them to build a larger cluster becomes meaningful. Then, it avoids non-robust linkages, for example merging the outlier objects at the lowest levels (Figure 1). Proposition 1 indicates that a linkage with a minimal length is reliable, i.e., the standard agglomerative strategy which combines only the nearest clusters at each step performs reliable selections.
Proposition 1**.**
Given a set of clusters and the respective linkages between them, a linkage with minimal length (called ) is a ‘reliable’ linkage.
Proof sketch.
We denote the clusters at the two sides of respectively and . Since has a minimal length among all linkages, thus, it will also be the smallest linkage connected to and the same for . Therefore, is the nearest neighbor of and is the nearest neighbor of , which makes the corresponding linkage (i.e., ) reliable. ∎
However, a minimal linkage is not the only reliable linkage, in particular when the data contain clusters with diverse densities, as demonstrated in Figure LABEL:fig:Context-Sensitive-Clust1. Thus, in order to build the dendrogram in a density-aware and adaptive way, at each level we may select all the linkages that are reliable. Thereby, this strategy at each step first finds all the reliable linkages, and then combines the respective clusters to build a larger cluster at a higher level. Algorithm 1 describes the procedure in detail providing an implementation of the high-level method in [3].
In this algorithm, is used to store the clusters at different levels, such that gives the cluster at the level . The variable indicates the current level while building the dendrogram. At the beginning, each individual object constitutes a separate cluster at level [math]. Next, the distance of each cluster at the current level (stored in ) to its nearest neighbor is computed and stored in . Function computes the inter-cluster distance between the two input clusters, according to a predefined criterion, e.g., single, complete and so on. Then, in graph (whose nodes represent the cluster indices at ) an edge is established if and only if the two respective clusters are nearest neighbors of each other (i.e., the linkage is reliable). Note that a cluster might have several nearest neighbors, i.e., several clusters might have the same (smallest) distance from that. At the next step, the connected components of the graph are extracted, where each of them represents a new cluster at a higher level. Thus, the clusters at the same connected components are combined to build a new single cluster at the higher level. This procedure (i.e., finding the nearest neighbors and combining them to build new higher-level clusters) continues until only one cluster is left at the highest level.
Notice that the several improvements developed for the standard strategy can be applied to this strategy as well. The computational complexity of this strategy is similar to the complexity of the standard variant. Both strategies establish in total linkages. For this, they compute the inter-cluster distances (linkages) according to a priori fixed criterion, and for each selected linkage, they merge the respective clusters and update the new inter-cluster linkages. Therefore, the operations and the computations are similar, whereas the choice of specific linkages and/or the order might differ which can lead to different dendrograms. Selecting all reliable linkages, instead of the smallest one, may reduce the overall number of steps, but it might need more merges at each step. However, as mentioned, the total number of merges is the same for both strategies.
On the other hand, as mentioned, an important computational advantage of the reliable strategy is the possibility of early stopping. It builds and develops several clusters simultaneously, whereas the standard approach develops fewer clusters at the same time. Thus, if we stop at early/intermediate steps, it is more likely that we obtain good representatives of different clusters. But, with the standard strategy, it could happen that only a few clusters are developed and the rest have not even been started yet. This might happen in particular when the clusters have diverse densities and shapes. Therefore, early-stopping, to reduce the computational time, can be more effective with reliable agglomerative clustering. For example, the early clusters can be exposed to the user to select only the interesting and relevant ones to develop further.
Algorithm 1 enables every object to potentially participate in building the dendrogram from the beginning, depending on having a reliable linkage. In other words, establishing and selecting a linkage and therefore growth of a cluster depends only on the relation of an object/cluster to its neighbors, independent of the relations of the other object/clusters with each other. However, this is not the case for the standard agglomerative clustering. Thus, if we stop the algorithm early, then, we will possibly have representatives of many clusters which correspond to the denser and more important (informative) parts. On the other hand, the outlier objects do not occur in the nearest neighborhood of many other clusters or objects. Thus, they join the other parts of the dendrogram only at the higher levels. Thereby, Algorithm 1 can be employed to provide a systematic way to separate structure from noise and outlier objects at different resolutions. The probability of object being an outlier is proportional to the level at which the object joins to the other objects/clusters, i.e.,
[TABLE]
where specifies the level at which object joins to one of the other clusters/objects for the first time. The higher is, the larger the outlier probability is. We postpone the detail to future work.
We may parametrize this strategy by a parameter such as which specifies the ratio of the (smallest) reliable linkages to be established at each step. A value close to zero then corresponds to the standard variant, whereas will be equal to the reliable strategy described in Algorithm1. In this way, we can provide even a richer family of alternative strategies for performing agglomerative clustering.
3 Reliable Minimum Spanning Trees
Minimum spanning trees (MSTs) are used in several applications such as transportation, computer and telecommunication networks [17], image segmentation [12], taxonomy learning [38] and power systems [29]. It is known that the single linkage method is equivalent to the Kruskal’s algorithm [24] for computing a minimum spanning tree [16]. Consistently, we study that Algorithm 1 with the single criterion also yields a minimum spanning tree (Theorem 1), which then its construction is adaptive w.r.t. the diverse density of the underlying data.
Before proving Theorem 1, we first introduce some notations. Consider a forest (collection) of trees . The distance (the edge weight) between the two trees and is computed according to the single criterion, i.e.,
[TABLE]
The nearest tree from tree , i.e. , is obtained via . Moreover, shows the edge corresponds to the nearest tree from , i.e.,
[TABLE]
where is the set of all current inter-tree edges.
Theorem 1**.**
The dendrogram generated by Algorithm 1 with the single linkage criterion computes a minimum spanning tree.
Proof sketch.
Consider a forest of trees . According to the connectivity condition of the final minimum spanning tree, every tree should be connected via an edge to the rest of the MST. This edge should be , i.e., an edge (linkage) with minimal weight among the edges of , to keep the spanning tree minimal. Otherwise, if a larger edge is selected, then, the resultant spanning tree will have a larger total weight (i.e., a contradiction occurs). The linkage suggested by Algorithm 1 (with the single criterion) satisfies this condition: The selected linkage is the smallest linkage connected to both and (the tree at the other side).
Hence, at the beginning, we consider each object as a separate tree, where all must belong to the final minimum spanning tree. Then, according to the aforementioned argument and based on induction, the edges selected at each step belong to the final MST. Thus, the final tree will be a minimum spanning tree. ∎
In this context, the generalized greedy algorithm [15] provides a general framework for computing minimum spanning trees, by showing that the edge is a consistent choice with a final minimum spanning tree. Thereby, a greedy MST algorithm, at each step, i) picks and , i.e., two candidate trees where at least one is the nearest neighbor of the other, ii) combines them via the smallest edge to build a larger tree, and iii) removes the selected trees and . The procedure continues until only a single tree with nodes remains, which is a MST.
Different algorithms, e.g. Kruskal’s and Prim’s [33], differ only in the way they pick the candidate trees at each step. Kruskal’s, at each step, picks a pair of trees that have a minimal distance among all pairs of trees. However, Prim’s produces the MST via growing only one tree, say , by iteratively attaching a singleton tree which has minimal distance to that, until it contains all the singleton trees.
Algorithm 1 with the single criterion yields an alternative viewpoint on the construction of MSTs. According to the generalized greedy algorithm, to combine two candidate trees, it is sufficient that one of them occurs in the nearest neighborhood of the other. However, Algorithm 1 requires that both trees mutually occur insides the nearest neighborhood of each other. As shown, e.g. in Figures 1 and 1, such a strategy yields a robust and adaptive minimum spanning tree. In summary,
I. the standard agglomerative method, in a very strict way, selects only one reliable linkage at each step, the one which has a minimal length (weight).
II. On the other hand, the generalized greedy algorithm for MST construction allows one to select any edge which occurs inside the nearest neighbors of one of the trees, regardless of being reliable or not (which might not be robust).
III. Algorithm 1 follows an intermediate strategy. It suggests to select all the reliable linkages (edges) which yields adaptation and flexibility (compared to the former approach) and robustness (compared to the latter approach).
IV. Parameterization of Algorithm 1 (by , as discussed before) can lead to an even larger family of different (reliable) minimum spanning tree algorithms.
We note that the final MST obtained by Algorithm 1 could be the same as the Kruskal’s MST. However, the order of selecting the edges differs. Thus, in particular, if we stop early constructing the MST, then the available solution could be different.
4 Experiments
We experimentally evaluate the performance of the reliable agglomerative strategy on a variety of real-world datasets and compare it against the standard approach. In these datasets, each object (i.e., document, image, etc) is represented by a vector according to the respective features. For the text documents, we use the tf-idf vectors. We compute the pairwise dissimilarities between the objects according to squared Euclidean distance measure.
Data
The first datasets are selected from the UCI data repository [11].
Ecoli: contains the information of protein localization sites in categories. 2. 2.
Hayes Roth: is related to a study on human subjects which contains instances and classes. 3. 3.
Iris: contains the information of iris plants grouped in classes. 4. 4.
Lung Cancer: includes types of instances of pathological lung cancer. 5. 5.
Perfume: consists of odors of different perfumes (classes), where there are in total measurements. 6. 6.
Seeds: includes measurements of geometrical properties of kernels belonging to different varieties of wheat. 7. 7.
Wine: contains measurements of a chemical analysis of different types of wines.
We also use the three main subsets of 20-newsgroup data collection:
COMP: a subset of documents in five groups: * ‘comp.graphics’, ‘comp.windows.x’, ‘comp.os.ms-windows.misc’, ‘comp.sys.ibm.pc.hardware’, ‘comp.sys.mac.hardware’*. 2. 2.
REC: a subset of documents in four groups related to race and sports: ‘rec.autos’, ‘rec.motorcycles’, ‘rec.sport.baseball’, ‘rec.sport.hockey’. 3. 3.
SCI: a subset of documents in four groups related to science: ‘sci.crypt’, ‘sci.electronics’, ‘sci.med’, ‘sci.space’.
In addition, we investigate the performance of different strategies on real datasets collected by a document processing corporation. The original dataset (called Real I) contains the vectors of scanned documents each represented in a dimensional space. This dataset contains clusters which several of them have only one or few documents. Then, by removing the clusters with only one or two documents, we obtain a new dataset, called Real II ( documents) . Finally, we obtain Real III by keeping the clusters that have at least documents ( documents).
Evaluation
To investigate the quality of a dendrogram, cophenetic correlation [40] is sometimes employed specially in biostatistics which measures the correlation between the dendrogram and the base dissimilarities between the objects. However, this evaluation measure has several issues, e.g. i) it considers only the direct distances and discards the manifolds or the elongated structures, and ii) its value is very sensitive to the way the inter-cluster distances are computed. For example, the two single and Ward criteria might lead to the same dendrograms, but their cophenetic correlation could significantly differ, since they compute different types of distances between the clusters (which constitute the elements of a dendrogram). However, in our experiments, we have access to the ground-truth, i.e. to the true labels of the objects. Thus, we may use early stopping up to clusters or eliminate the last linkages from a dendrogram to produce clusters. There exist more involved methods to convert a dendrogram into a set of clusters, but they require fixing critical parameters in advance which finding their correct values is non-trivial in an unsupervised setting such as clustering. With both strategies, ties might occur when producing exactly clusters. We tackle the problem in the same way as the common implementations do, e.g. we break the ties according to the order (index) of the clusters, where all other tricks are applicable to both approaches as well. Moreover, we observe such ties usually occur at the lower levels of the dendrogram, i.e., for a very large . For a rather small , which is the case in many clustering problems, such ties are very rare. In real data it does not often happen that many real clusters are mutually the nearest neighbors of each other. Having multiple reliable linkages to establish occurs at the low or intermediate levels. Thus, at the higher level, where we remove the linkages, ties are not common.
We compare the true and the computed clusters according to three criteria:
Normalized Mutual Information [41], which measures the mutual information between the true and the estimated solutions. 2. 2.
Normalized Rand score [21], which computes the similarity between the two solutions. 3. 3.
V-measure [34], which obtains the harmonic mean of homogeneity and completeness.
We compute the normalized variant of these measures, such that they yield zero for randomly estimated solutions and thereby any positive score indicates a (partially) consistent solution.
Results
Tables 1, 2 and 3 show the performance scores in order w.r.t. Normalized Mutual Information, Normalized Rand score and V-measure, where the best results for each dataset are bolded. We observe that on different datasets, the reliable agglomeration strategy always contributes to the best results. In most cases, it improves significantly the best results of the standard strategy, and in fewer cases it yields consistent results with that. Moreover, in few cases it could happen that for a non-optimal criterion, the reliable variant yields (slightly) worse results. However, such a criterion is not the best choice and the respective scores are not high compared to the alternatives. For example, on Real I and Real II with the centroid criterion, the standard strategy yields slightly better scores than the reliable strategy. However, the centroid criterion is not the best option and yields anyway very low scores. With a more appropriate criterion (e.g. average and Ward), the reliable strategy gives significantly higher scores. Note that the different evaluation measures are often consistent, but in some cases they might disagree. For example, on the Seeds dataset, Normalized Mutual Information suggests the Ward criterion as the best option, but Normalized Rand score selects the average criterion, although Ward still yields high scores. Finally, it is notable that we observe similar experimental runtimes for the two strategies, as they perform similar operations. For example, on the COMP dataset and with the single criterion, the runtimes of the standard and reliable strategies are and seconds. With the average criterion, the runtimes respectively are and seconds.
5 Conclusion
We investigated an adaptive and density-consistent strategy for agglomerative clustering, wherein at each step we establish all the reliable linkages, instead of establishing only the smallest one (consistent with the high-level method in [3]). The two clusters connected by a reliable linkage share similar properties, such that they select each other as a nearest neighbor. This strategy enables the dendrogram to be adaptive w.r.t. the diverse densities of different clusters and supports early stopping the clustering procedure. In the following, we studied how reliable agglomerative clustering with the single criterion can be used to produce a minimum spanning tree. Finally, we performed experiments on several real-world datasets to investigate the performance of the reliable agglomerative strategy.
Acknowledgement. This work was partially done at Xerox Research Centre Europe (XRCE).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Margareta Ackerman and Shai Ben-David. A characterization of linkage-based hierarchical clustering. Journal of Machine Learning Research , 17:1–17, 2016.
- 2[2] Maria-Florina Balcan, Yingyu Liang, and Pramod Gupta. Robust hierarchical clustering. J. Mach. Learn. Res. , 15(1):3831–3871, 2014.
- 3[3] Michel Bruynooghe. Méthodes nouvelles en classification automatique de données taxinomiques nombreuses. Statistique et analyse des données , 2(3):24–42, 1977.
- 4[4] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’17, pages 841–854, 2017.
- 5[5] Morteza Haghir Chehreghani, Hassan Abolhassani, and Mostafa Haghir Chehreghani. Improving density-based methods for hierarchical clustering of web pages. Data Knowl. Eng. , 67(1):30–50, 2008.
- 6[6] Morteza Haghir Chehreghani, Mostafa Haghir Chehreghani, and Hassan Abolhassani. Probabilistic heuristics for hierarchical web data clustering. Computational Intelligence , 28(2):209–233, 2012.
- 7[7] K. Chidananda Gowda and G. Krishna. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition , 10(2):105–112, 1978.
- 8[8] Michael Cochez and Hao Mou. Twister tries: Approximate hierarchical agglomerative clustering for average distance in linear time. In SIGMOD ’15 , pages 505–517, 2015.
