Visualization tools for parameter selection in cluster analysis
Alexander Rolle, Luis Scoccola

TL;DR
This paper introduces HPREF, an algorithm that visualizes the structure of multiple clusterings of a dataset, aiding parameter selection in cluster analysis.
Contribution
The paper presents HPREF, a novel hierarchical partitioning algorithm that visualizes the space of clusterings generated by varying parameters.
Findings
Provides a geometric visualization of clustering results
Helps identify optimal parameters for clustering algorithms
Enhances understanding of clustering stability and variability
Abstract
We propose an algorithm, HPREF (Hierarchical Partitioning by Repeated Features), that produces a hierarchical partition of a set of clusterings of a fixed dataset, such as sets of clusterings produced by running a clustering algorithm with a range of parameters. This gives geometric structure to such sets of clustering, and can be used to visualize the set of results one obtains by running a clustering algorithm with a range of parameters.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18| # clusterings | adj. Rand | adj. Rand | adj. Rand | adj. Rand |
|---|---|---|---|---|
| in class | mean | min. | max. | std. deviation |
| clusterings | number of | ||||
|---|---|---|---|---|---|
| clusterings | mean | min. | max. | std. deviation | |
| 40 | |||||
| k-means | 20 | ||||
| k-means++ | 20 |
| # clusterings | ||||
| in class | mean | min. | max. | std. deviation |
| 4 | ||||
| 13 | ||||
| 1 | – | |||
| 1 | – | |||
| 1 | – | |||
| 20 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Data Mining Algorithms and Applications
Visualization tools for parameter selection in cluster analysis
Alexander Rolle and Luis Scoccola
Abstract
We propose an algorithm, HPREF (Hierarchical Partitioning by Repeated Features), that produces a hierarchical partition of a set of clusterings of a fixed dataset, such as sets of clusterings produced by running a clustering algorithm with a range of parameters. This gives geometric structure to such sets of clusterings, and can be used to visualize the set of results one obtains by running a clustering algorithm with a range of parameters.
1 Introduction
Often, a clustering algorithm, rather than producing a single clustering of a dataset, produces a set of clusterings. For example, one gets a set of clusterings by running a clustering algorithm with a range of parameters. The starting point of this paper is the observation that sets of clusterings ought to have geometric structure. Indeed, various metrics have been proposed for the set of all clusterings of a fixed dataset [10, 4, 11].
In this paper, we define a metric on any set of clusterings of a fixed dataset that is particularly convenient for visualization. The metric is induced by a hierarchical partition of , which is defined as follows. Any pair of data points can be used to partition into two classes: a class containing those clusterings that cluster together and , and a class containing those that do not. Say that two pairs of data points are equivalent if they define the same partition of . A large equivalence class defines a partition of that is witnessed by many pairs of data points. We produce a hierarchical partition of by successively partitioning according to the largest equivalence classes. Using pairs of data points to discriminate between different clusterings has a long history, particularly in the many variations on the so-called Rand index [12, 6, 14, 7]. The voting-style method of this paper is a practical, scalable way to adapt these ideas for the detection and visualization of the large-scale features of a set of clusterings.
In Section 2 we describe this procedure in detail, and in Section 3 demonstrate how our algorithm can be used as a visualization tool. An implementation of HPREF is available at [13].
2 Hierarchical Partitioning by repeated features
By a clustering of a set , we mean a set of disjoint subsets of . Points of that do not belong to any subset in a clustering of are called noise points.
We begin by recalling a well-known way to encode clusterings as binary vectors. Given a set of clusterings of , there is an embedding
[TABLE]
where , is the set of unordered pairs of points of (with repetition), and is the set of binary vectors indexed by . For , let if clusters and together, and otherwise. We allow pairs with repetition to distinguish noise points from one-point clusters: if is a noise point of , then , but if is a cluster of , then . Write and . We’ll think of as an matrix, so that a row of is for some . Each column of corresponds to a pair of points of .
It is not usually practical to consider all pairs of data points when constructing the matrix . Instead, one can first sample pairs from the dataset, then construct with columns corresponding only to the sampled pairs. Because HPREF uses columns of that occur often, the method is robust with respect to this sampling; see Section 3.2 for experimental results.
Hierarchical partitioning.
A hierarchical partition of a set is a tree, where each node is associated to a subset of , such that the root is associated to , and the set associated to any node is a subset of the set associated to its parent.
A non-constant binary vector is a vector that contains both zeroes and ones. Each non-constant column of a binary matrix partitions the set of rows of into two classes: the class of rows with a zero in column , and the class of rows with a one in column . If is the set of rows of , let us denote these two classes of rows by and respectively. Let be a scoring function that assigns a numeric score to any set of clusterings of a fixed dataset. We discuss the choice of scoring function below.
As input, the algorithm takes a set of clusterings of a fixed dataset, and . The output is a hierarchical partition of .
Initialize a binary tree with just one node, and associate the set to it; 2. 2.
While the number of leaves of is less than :
For each leaf of , let be its associated set, and define the score of to be ; 2. 4.
Let be the leaf with the highest score, and let be its associated set; 3. 5.
Let be the most repeated non-constant column of the matrix , and partition into the classes and ; 4. 6.
Add two children to , one with associated set , and the other with associated set . 3. 7.
Return .
Dendrograms.
Hierarchical partitions are especially useful when they can can be represented as dendrograms. By dendrogram we mean a hierarchical partition of a set, where each node has a weight such that the weight of any node is smaller than the weight of its parent. The weights allow us to visualize the dendrogram in two dimensions.
The output of HPREF can be represented by a dendrogram: let the weight of a node be given by the score of plus the sum of the scores of all the nodes that were added to the tree after .
Moreover, by a well-known construction (see, e.g., [2]), this dendrogram defines a metric on .
Scoring functions.
The goal of the scoring function is to quantify how much a set of clusterings deserves to be partitioned.
Say we are given a set of clusterings , and form the binary matrix . Let be the multiplicity of the most repeated non-constant column of , and let be the number of non-constant columns of . A large value of indicates a partition of that is witnessed by many pairs of data points, and a large value of indicates heterogeneity in . HPREF uses the scoring function
[TABLE]
Complexity.
Let be the number of pairs of points of that we choose to sample, and let . Assume that . Using dictionaries implemented as tries, the time complexity of HPREF is in . The same analysis shows that this is also the space complexity.
3 Examples
In this section we present two examples. In the first, we generate a set of clusterings of Fisher’s Iris dataset by running DBSCAN with a range of parameters, and show how one can visualize this set of clusterings using HPREF. Our algorithm allows one to easily identify the parameters for which DBSCAN separates the three species of Iris in the dataset.
In the second example, we generate a set of clusterings of a large dataset used for The Third International Knowledge Discovery and Data Mining Tools Competition, by running -means with different initializations. This example shows that, even using a very small sample of pairs of data points, our algorithm produces meaningful results.
For the examples, we use the scikit-learn implementations of DBSCAN and -means.
3.1 Clustering the Iris dataset with DBSCAN
In a survey paper on density-based clustering by Kriegel, Kröger, Sander, and Zimek, the authors write that density-based clustering algorithms are “particularly suitable” for certain applications coming from biology [8, p232]; an example they give is Fisher’s Iris dataset [5], which illustrates the “typical properties of natural (biological) clusters” [8, p233]. The Iris dataset records the petal and sepal width and length of 150 Iris flowers. There are observations of each of the species Setosa, Versicolor, and Virginica; the observations of Setosa are linearly separable from the observations of Versicolor and Virginica, but the latter two are not linearly separable from each other.
The clustering algorithm DBSCAN may be the best known density-based clustering algorithm, and is a main topic of [8]. It takes two parameters: a distance scale , and a density threshold . We use HPREF to study the output of DBSCAN on the Iris dataset, as the parameters vary.
Let be the set of clusterings of the Iris dataset obtained by running DBSCAN with .
HPREF.
The hierarchical clustering of produced by HPREF with is shown in Fig. 1. The most repeated column of appears times. Recall that, at each node of the hierarchy, we are considering a matrix obtained from by selecting a class of rows; the most repeated column of these matrices, in the order they appear in the hierarchy, appears , , , , , times.
The red class of Fig. 2, which consists of the clusterings obtained with , contains the best solutions to the clustering problem: the clusterings in this class do a reasonable job of separating the three species of Iris. The brown class of Fig. 2, consisting of the clusterings obtained with
[TABLE]
also contains interesting results, but these clusterings are further from the “correct” clustering of the dataset, as they separate the observations of Iris Virginica into multiple clusters. See Fig. 3 for representatives of the different classes.
Since the Iris dataset is labeled, we can compare the clusterings produced by DBSCAN with the labels, using one of the standard distances between clusterings. In Table 1 we compute the average, maximum, minimum, and standard deviation of the adjusted Rand index ([7]) between the clusterings of each of the classes of the second partition of Fig. 2 and the labels. We regard noise points as one-point clusters when computing the adjusted Rand index. We see that, according to the adjusted Rand index, the red class coincides with the best four clusterings of .
Alternative visualization of the space of clusterings.
We apply PCA to the rows of . We keep the first components, which account for approximately of the variance. We plot the first components of the rows in Fig. 4, colored according to the last partition of Fig. 2. In this example, HPREF captures much of the geometric structure that we see in the visualization produced by PCA: the partitions of produced by HPREF correspond well to the clustering structure we see in the visualization, and the order in which these distinctions appear in the hierarchy reflect the extent to which these distinctions are obvious in the visualization.
3.2 Choosing initial centers for -means
Given a finite set of points in euclidean space, the -means problem is to choose centers that minimize , the sum of the squared distance between each point and its closest center. A commonly used algorithm to find approximate solutions to the -means problem is due to Lloyd [9]. The algorithm begins by choosing centers at random from the dataset. It then assigns each data point to its closest center, and recomputes each center as the center of mass of the points assigned to it. This step is repeated until the process stabilizes, to obtain centers . This produces a clustering with clusters, for which a point belongs to the cluster if the closest center to is .
Of course, the outcome of Lloyd’s algorithm depends on the choice of the initial centers. A common approach is to choose these initial centers uniformly at random from the dataset. In [1], Arthur and Vassilvitskii propose a more sophisticated approach: choosing the initial centers at random from the dataset, but weighing data points according to their squared distance from the closest center already chosen.
Following [1], we’ll refer to Lloyd’s algorithm, with initial centers chosen uniformly at random from the dataset, as k-means, and we’ll refer to Lloyd’s algorithm, with initial centers chosen according to the method of [1], as k-means++.
In [1], Arthur and Vassilvitskii compare the performance of k-means and k-means++ on four datasets, including the KDD Cup 1999 dataset from the University of California–Irvine Machine Learning Repository [3].
The KDD Cup 1999 dataset simulates features available to an intrusion detection system, and was the dataset used for The Third International Knowledge Discovery and Data Mining Tools Competition. We used the full dataset available at the UCI Machine Learning Repository, which consists of points. We kept the continuous features, ignoring the categorical features. Following [1], we consider a set of 40 clusterings of the KDD Cup 1999 dataset, with 20 produced by k-means and 20 produced by k-means++, both with . Information about the associated values of is in Table 2.
HPREF.
We run HPREF on with and a sample of pairs of data points. The resulting hierarchy is displayed in Fig. 5. The most repeated column of appears times, and exactly separates the output of k-means and k-means++. I.e., the partition of corresponding to the red cut of Fig. 5 has one class containing the clusterings produced by k-means, and another class containing the clusterings produced by k-means++.
To get a finer partition of , we consider the partition corresponding to the green cut of Fig. 5. This is the finest partition produced by HPREF that does not divide the output of k-means++ into multiple classes. The elements of produced by k-means are partitioned into five classes, three of which are singletons. The results are displayed in Table 3.
We see that the partition corresponds well to the values of the clusterings. In particular, while the standard deviation of the values of the k-means runs is on the order of , the standard deviation of the values in each class is on the order of or less.
Sampling and performance.
To test the reliability of this result, we run HPREF on samples of pairs of data points, each time with . In every case, the partition obtained by taking the leaves of the hierarchy is exactly the result displayed in Table 3. We also run HPREF on samples of pairs of data points, and obtain the result of Table 3 times.
On a laptop with a 2.20 GHz Intel Core i7 and 16GB of RAM, using the scikit-learn implementation of k-means and k-means++, it took hours and minutes to generate the set of clusterings of the KDD Cup 1999 dataset. Running HPREF on samples of pairs of data points took minutes. Running HPREF on samples of pairs of data points took minutes.
Acknowledgments
We would like to thank Dan Christensen, Camila de Souza, and Rick Jardine for their helpful comments and suggestions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] David Arthur and Sergei Vassilvitskii. k-means++ : the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 1027–1035. ACM, New York, 2007.
- 2[2] Gunnar Carlsson and Facundo Mémoli. Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. , 11:1425–1470, 2010.
- 3[3] Dataset. KDD Cup, UCI Machine Learning Repository, 1999. http://archive.ics.uci.edu/ml/datasets/kdd+cup+1999+data .
- 4[4] Stijn Van Dongen. Performance criteria for graph clustering and markov cluster experiments. Technical Report INS-R 0012, Centrum voor Wiskunde en Informatica, 2000.
- 5[5] R.A. Fisher. Iris dataset, UCI Machine Learning Repository, 1936. https://archive.ics.uci.edu/ml/datasets/Iris .
- 6[6] E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association , 78(383):553–569, 1983.
- 7[7] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification , 2(1):193–218, Dec 1985.
- 8[8] Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , 1(3):231–240, 2011.
