Benchmarking Minimax Linkage
Xiao Hui Tai, Kayla Frisoli

TL;DR
This paper provides a comprehensive benchmarking of minimax linkage in hierarchical clustering, evaluating its performance across multiple datasets and metrics, and making the code publicly available for reproducibility.
Contribution
It offers a thorough, neutral benchmark study of minimax linkage, expanding on prior analyses with multiple performance metrics and datasets, and emphasizes reproducibility.
Findings
Minimax linkage often yields the smallest maximum minimax radius.
It produces tightly clustered objects around prototypes.
Performance varies when the number of clusters equals the true value.
Abstract
Minimax linkage was first introduced by Ao et al. [3] in 2004, as an alternative to standard linkage methods used in hierarchical clustering. Minimax linkage relies on distances to a prototype for each cluster; this prototype can be thought of as a representative object in the cluster, hence improving the interpretability of clustering results. Bien and Tibshirani analyzed properties of this method in 2011 [2], popularizing the method within the statistics community. Additionally, they performed comparisons of minimax linkage to standard linkage methods, making use of five data sets and two different evaluation metrics (distance to prototype and misclassification rate). In an effort to expand upon their work and evaluate minimax linkage more comprehensively, our benchmark study focuses on thorough method evaluation via multiple performance metrics on several well-described data sets. We…
| Dataset | Distance to | Misclassification |
| prototype | rate | |
| Olivetti Faces | Yes | No |
| Grolier Encyclopedia | Yes | No |
| Colon Cancer | Not quite | No |
| Prostate Cancer | Not quite | No |
| Simulations | No | Yes |
| Data set | Included | Description |
| in (Bien and Tibshirani, 2011)? | ||
| \pbox3cmOlivetti Faces | ||
| (Roweis) | Yes | \pbox4.1cm n = 400, p = 4096 |
| = 40 | ||
| image data of human faces | ||
| \pbox3cmColon Cancer | ||
| (Alon et al. 1999) | Yes | \pbox4.1cm n = 62, p = 2000 |
| = 2 | ||
| high dimensional data | ||
| \pbox3cmProstate Cancer | ||
| (Singh et al. 2002) | Yes | \pbox4.1cm n = 102, p = 6033 |
| = 2 | ||
| high dimensional data | ||
| \pbox3cmSpherical | ||
| (Bien et al. 2011) | Yes | \pbox[c]4.1cm n = 300, p = 10 |
| = 3 | ||
| spherical shape | ||
| L-1 and L-2 distance used | ||
| \pbox3cmElliptical | ||
| (Bien et al. 2011) | Yes | \pbox[c]4.1cm n = 300, p = 10 |
| = 3 | ||
| elliptical shape | ||
| L-1 and L-2 distance used | ||
| \pbox3cmOutlier | ||
| (Bien et al. 2011) | Yes | \pbox[c]4.1cm n = 300, p = 10 |
| = 3 | ||
| spherical shape with outliers | ||
| L-1 and L-2 distance used | ||
| \pbox3cmIris | ||
| (Anderson, 1936; Fisher, 1936) | No | \pbox[c]4.1cm n = 150, p = 4 |
| = 3 | ||
| elliptical shape | ||
| well-separated clusters | ||
| \pbox3cmNBIDE | ||
| (Vorburger et al. 2007) | No | \pbox[c]4.1cm n = 144, p = 144,000 |
| = 12 | ||
| image data of cartridge cases | ||
| \pbox3cmFBI S&W |
| Linkage type | Max minimax | Misclass- | Precision | Recall |
|---|---|---|---|---|
| radius | ification | |||
| single | 3394.93 | 0.40 | 0.04 | 0.78 |
| complete | 2606.25 | 0.04 | 0.31 | 0.49 |
| average | 2449.69 | 0.07 | 0.18 | 0.60 |
| centroid | 3259.74 | 0.79 | 0.02 | 0.83 |
| minimax | 2293.45 | 0.05 | 0.24 | 0.57 |
| Data set ( = truth) | Linkage type | Max minimax radius | Misclassification | Precision | Recall |
|---|---|---|---|---|---|
| \pbox3cmOlivetti Faces | |||||
| = 40 | single | 3394.93 | 0.40 | 0.04 | 0.78 |
| complete | 2606.25 | 0.04 | 0.31 | 0.49 | |
| average | 2449.69 | 0.07 | 0.18 | 0.60 | |
| centroid | 3259.74 | 0.79 | 0.02 | 0.83 | |
| minimax | 2293.45 | 0.05 | 0.24 | 0.57 | |
| \pbox3cmColon Cancer | |||||
| = 2 | single | 0.34 | 0.46 | 0.54 | 0.98 |
| complete | 0.28 | 0.48 | 0.53 | 0.87 | |
| average | 0.28 | 0.48 | 0.53 | 0.87 | |
| centroid | 0.28 | 0.47 | 0.53 | 0.90 | |
| minimax | 0.29 | 0.48 | 0.53 | 0.92 | |
| \pbox3cmProstate Cancer | |||||
| = 2 | single | 0.48 | 0.50 | 0.50 | 0.98 |
| complete | 0.33 | 0.49 | 0.50 | 0.77 | |
| average | 0.35 | 0.49 | 0.50 | 0.73 | |
| centroid | 0.40 | 0.49 | 0.50 | 0.69 | |
| minimax | 0.35 | 0.49 | 0.50 | 0.76 | |
| \pbox3cmSpherical-L2 | |||||
| = 3 | single | 6.07 | 0.66 | 0.33 | 0.99 |
| complete | 5.13 | 0.24 | 0.63 | 0.64 | |
| average | 5.95 | 0.66 | 0.33 | 0.98 | |
| centroid | 6.07 | 0.66 | 0.33 | 0.99 | |
| minimax | 5.35 | 0.25 | 0.62 | 0.65 | |
| \pbox3cmSpherical-L1 | |||||
| = 3 | single | 15.97 | 0.66 | 0.33 | 0.99 |
| complete | 14.26 | 0.33 | 0.51 | 0.55 | |
| average | 15.72 | 0.66 | 0.33 | 0.98 | |
| centroid | 15.75 | 0.66 | 0.33 | 0.99 | |
| minimax | 14.87 | 0.33 | 0.51 | 0.51 | |
| \pbox3cmElliptical-L2 | |||||
| = 3 | single | 6.79 | 0.66 | 0.33 | 0.99 |
| complete | 5.95 | 0.35 | 0.48 | 0.51 | |
| average | 6.66 | 0.66 | 0.33 | 0.96 | |
| centroid | 6.76 | 0.66 | 0.33 | 0.99 | |
| minimax | 6.21 | 0.38 | 0.44 | 0.51 | |
| \pbox3cmElliptical-L1 | |||||
| = 3 | single | 17.40 | 0.66 | 0.33 | 0.99 |
| complete | 16.40 | 0.38 | 0.44 | 0.59 | |
| average | 17.40 | 0.66 | 0.33 | 0.96 | |
| centroid | 17.37 | 0.66 | 0.33 | 0.99 | |
| minimax | 15.60 | 0.33 | 0.50 | 0.57 | |
| \pbox3cmOutliers-L2 | |||||
| = 3 | single | 6.46 | 0.66 | 0.33 | 0.99 |
| complete | 5.81 | 0.46 | 0.38 | 0.65 | |
| average | 6.12 | 0.65 | 0.33 | 0.95 | |
| centroid | 6.37 | 0.66 | 0.33 | 0.98 | |
| minimax | 5.95 | 0.39 | 0.44 | 0.65 | |
| \pbox3cmOutliers-L1 | |||||
| = 3 | single | 17.39 | 0.66 | 0.33 | 0.99 |
| complete | 15.99 | 0.42 | 0.39 | 0.50 | |
| average | 16.37 | 0.66 | 0.33 | 0.97 | |
| centroid | 16.37 | 0.66 | 0.33 | 0.98 | |
| minimax | 14.79 | 0.26 | 0.60 | 0.61 |
| Data set | Linkage type | Max minimax radius | Misclassification | Precision | Recall |
|---|---|---|---|---|---|
| \pbox3cmIris | |||||
| = 3 | single | 2.97 | 0.23 | 0.59 | 0.99 |
| complete | 2.19 | 0.20 | 0.67 | 0.79 | |
| average | 2.56 | 0.22 | 0.60 | 0.96 | |
| centroid | 2.97 | 0.23 | 0.59 | 0.99 | |
| minimax | 2.09 | 0.17 | 0.71 | 0.79 | |
| \pbox3cmNBIDE | |||||
| = 12 | single | 0.82 | 0.23 | 0.23 | 0.89 |
| complete | 0.77 | 0.05 | 0.66 | 0.79 | |
| average | 0.75 | 0.03 | 0.77 | 0.91 | |
| centroid | 0.83 | 0.80 | 0.08 | 0.87 | |
| minimax | 0.73 | 0.02 | 0.84 | 0.92 | |
| \pbox3cmFBISW | |||||
| = 69 | single | 0.75 | 0.01 | 0.33 | 0.86 |
| complete | 0.65 | 0.00 | 0.83 | 0.93 | |
| average | 0.63 | 0.00 | 0.77 | 0.91 | |
| centroid | 0.82 | 0.17 | 0.02 | 0.58 | |
| minimax | 0.59 | 0.00 | 0.70 | 0.90 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Machine Learning and Data Classification · Machine Learning and Algorithms
MethodsArtemisinin Optimization based on Malaria Therapy: Algorithm and Applications to Medical Image Segmentation · Interpretability
Benchmarking Minimax Linkage
Xiao Hui Tai
Carnegie Mellon University
and
Kayla Frisoli
Carnegie Mellon University
Abstract.
Minimax linkage was first introduced by Ao et al. (Cheung et al., 2004) in 2004, as an alternative to standard linkage methods used in hierarchical clustering. Minimax linkage relies on distances to a prototype for each cluster; this prototype can be thought of as a representative object in the cluster, hence improving the interpretability of clustering results. Bien and Tibshirani analyzed properties of this method in 2011 (Bien and Tibshirani, 2011), popularizing the method within the statistics community. Additionally, they performed comparisons of minimax linkage to standard linkage methods, making use of five data sets and two different evaluation metrics (distance to prototype and misclassification rate). In an effort to expand upon their work and evaluate minimax linkage more comprehensively, our benchmark study focuses on thorough method evaluation via multiple performance metrics on several well-described data sets. We also make all code and data publicly available through an R package, for full reproducibility. Similarly to (Bien and Tibshirani, 2011), we find that minimax linkage often produces the smallest maximum minimax radius of all linkage methods, meaning that minimax linkage produces clusters where objects in a cluster are tightly clustered around their prototype. This is true across a range of values for the total number of clusters (). However, this is not always the case, and special attention should be paid to the case when is the true known value. For true , minimax linkage does not always perform the best in terms of all the evaluation metrics studied, including maximum minimax radius. This paper was motivated by the IFCS Cluster Benchmarking Task Force’s call for clustering benchmark studies and the white paper (Mechelen et al., 2018), which put forth guidelines and principles for comprehensive benchmarking in clustering. Our work is designed to be a neutral benchmark study of minimax linkage.
Minimax linkage, hierarchical clustering, benchmark analysis
††copyright: none††conference: In submission to IFCS Cluster Benchmarking Challenge; 2019††journalyear: ;††booktitle: submission
1. Introduction
Hierarchical agglomerative clustering involves successively grouping items within a dataset together, based on similarity of the items. The algorithm finishes once all items have been linked, resulting in a hierarchical group similarity structure. Given that two items are merged together, we must determine how similar that merged group is to the remaining items (or groups of items). In other words, we have to recalculate the dissimilarity between any merged points. This dissimilarity between groups can be defined in many ways, and these are known as linkage methods. Standard, established linkage methods include single, complete, average and centroid linkage. Minimax linkage, which was first introduced in (Cheung et al., 2004) and formally analyzed in (Bien and Tibshirani, 2011), will be the subject of our evaluation. We describe hierarchical agglomerative clustering and the linkage methods precisely as follows.
Given items , dissimilarities between each pair and , and dissimilarities between groups and , hierarchical agglomerative clustering starts with each node in a single group, and repeatedly merges groups such that is within a threshold . is determined by the linkage method, defined as follows:
- Single linkage . The distance between two clusters is defined as the distance between the closest points across the clusters.
- Complete linkage . The distance between two clusters is defined as the distance between the farthest points across the clusters.
- Average linkage . The distance between two clusters is defined as the average of the distances across all pairs of points across the clusters.
- Centroid linkage , where and . The distance between two clusters is defined as the distance between the centroid (mean) of the points within the first cluster and the centroid (mean) of the points in the second cluster. Often, the centroids have no intuitive interpretation (e.g., when items are text or images).
- Minimax linkage , where , the radius of a group of nodes around . Informally, each point belongs to a cluster whose center satisfies .
Bien and Tibshirani (Bien and Tibshirani, 2011) expand upon Ao et al. (Cheung et al., 2004) by providing a more comprehensive evaluation of minimax linkage. In particular, they compare minimax linkage to the standard linkage methods using five data sets and two different evaluation metrics. Additionally (although not the focus of the current paper), the authors prove several theoretical properties, for example that dendrograms produced by minimax linkage cannot have inversions and are robust to some data perturbations. They also perform additional evaluations, compare prototypes to centroids, and benchmark computational speed.
The comparisons of minimax linkage to standard linkage methods in (Bien and Tibshirani, 2011) are summarized in Table 1. For the colon and prostate cancer data sets, distance to prototype was calculated for minimax linkage, but not for the other linkage methods, since those two data sets were used to compare prototypes to centroids, rather than compare the different linkage methods. More details on the data sets and metrics used are in Sections 2 and 3.
“Benchmarking in cluster analysis: A white paper” (Mechelen et al., 2018) makes multiple recommendations for analyses of clustering methods. We focus on those for data sets and evaluation metrics.
The first recommendation in (Mechelen et al., 2018) with respect to choosing data sets is to “make a suitable choice of data sets and give an explicit justification of the choices made.” This was not done thoroughly in the original Bien and Tibshirani paper. It was not explained why the particular data sets were chosen for the different evaluations, and features of the data sets were not fully described. In our study, we both add additional data sets and justify existing ones (which include both synthetic and empirical data) in Section 2.2.
With respect to evaluation metrics, (Mechelen et al., 2018) recommends that we think carefully about criteria used and justify our choices. They also recommend that we “consider multiple criteria if appropriate.” Additionally, criteria should be applied across all data sets, and this is one of our main critiques of the existing evaluation, where not all of the data sets used were evaluated on all the criteria suggested (in Table 1, all cells should be “Yes”).
Distance to prototype was well-justified (this is the crux of minimax linkage), but not misclassification rate. While interpretable cluster representatives are important, a researcher may also care about how accurately the algorithm classifies the items in the data set. That being said, when there are a large number of small clusters, the misclassification rate might not be the best measure of performance. In such cases, when working with pairwise comparisons, there is often a large class imbalance problem; most pairs of items do not truly match. A method could achieve a very low misclassification rate simply by predicting all pairs to be non-matches. Therefore we chose to include precision and recall as an additional metric to evaluate clustering quality.
Finally, a suggestion of (Mechelen et al., 2018) is to fully disclose data and code. Unlike the original paper, we supply the code and data that accompanies this paper, for full reproducibility. We have also written an R package, clusterTruster, available on GitHub (https://github.com/xhtai/clusterTruster), which allows the performance of additional evaluations on user-supplied data.
This paper is designed to be a neutral benchmark study of minimax linkage, and the specific contributions are:
- (1)
An evaluation of all data sets on all of the criteria in (Bien and Tibshirani, 2011) 2. (2)
A better assessment of performance with the utilization of precision and recall 3. (3)
An evaluation on additional (diverse) data sets not in (Bien and Tibshirani, 2011) 4. (4)
Providing publicly available code and an R package that allow for full reproducibility and transparency, while simplifying the process of making additional evaluations on user-supplied data.
The rest of the paper is organized as follows. Section 2 describes our benchmark study, including justifications for the data sets and evaluation metrics used. Section 3 presents the results, and Section 4 concludes.
2. Benchmark Study
In this benchmark study we both introduce new evaluation metrics (which we apply to every data set), and add new data sets to provide for a more comprehensive analysis. These are detailed as follows.
2.1. Evaluation metrics
to “ Our first improvement to (Bien and Tibshirani, 2011) is to utilize all evaluation metrics provided in the paper on all of the data sets (as opposed to some metrics on some data sets). In other words, any instance of “Not quite” or “No” within Table 1 to should be changed to “Yes.” Additionally, we introduce precision and recall as additional evaluation metrics. The evaluation metrics used are described as follows.
Distance to prototype
The distance to prototype is measured by the maximum minimax radius. The radius of a group of nodes around was defined in Section 1, as . This is the distance of the farthest point in cluster G from point .
The prototype is selected to be the point in G with the minimum radius, and this radius is known as the minimax radius,
[TABLE]
(Using this notation, minimax linkage between two clusters and can also be written as .)
Now, for a clustering with clusters, each of the clusters is associated with a minimax radius, . We consider the maximum minimax radius, , in other words the “worst” minimax radius across all clusters. In this sense, the maximum minimax radius can be thought of as a measure of the tightness of clusters around their prototype. A small value indicates that points within the cluster are close to their prototypes, meaning that the prototype is an accurate representation of points within the cluster.
Minimax linkage relies on successively merging clusters to produce the smallest maximum radius of the resulting cluster, so we would expect minimax linkage to perform the best among other linkage methods in terms of producing the smallest maximum minimax radii.
Misclassification rate
The misclassification rate is defined as the proportion of misclassified examples out of all the examples.
[TABLE]
In the clustering context, misclassification rate is defined on pairs of items, specifically we consider each of the pairs, where is the number of individual items, and the outcome of interest is whether the pair is predicted to be in the same cluster or not. A pair is misclassified if the clustering method predicts that the pair is in the same cluster when the true clustering says they are not, or vice versa.
A low misclassification rate typically indicates high accuracy (a good classifier). But, in cases with a large class imbalance (typically many non-matches and few matches) we need to be careful with using misclassification rates because simply classifying all items as non-matches produces a very low misclassification rate.
Precision and recall
To take into account class imbalance, we use the evaluation metrics precision and recall. A typical confusion matrix is below.
[TABLE]
[TABLE]
[TABLE]
Both precision and recall do not include the true negative cell in their calculation and therefore produce fairer estimates of accuracy in class imbalanced data sets, which are common to clustering. The maximum value for precision and recall are both 1, and a good classifier should have high precision and recall.
All , best vs true
Again, define as the number of clusters in the clustering. In (Bien and Tibshirani, 2011), evaluation for distance from prototype was conducted over all possible values of (specifically in a data set of items, ). Misclassification rate however was reported for the best , meaning the lowest misclassification rate over all , and the true , where the ground truth clustering is known.
In this paper we evaluate on all metrics using all , and also report the metrics for true . It is possible to derive measures for the best , but due to the large number of data sets and evaluation metrics used, this became somewhat intractable and was not pursued further, but can be a subject of future work.
2.2. Data sets
In terms of the data sets considered, we use all of the data used in (Bien and Tibshirani, 2011) (except for Grolier Encyclopedia), and introduce additional data sets that exhibit a wider range of data attributes. These additional data sets were included also to ensure that those used in (Bien and Tibshirani, 2011) were not deliberately selected to produce desired results. The Grolier Encyclopedia data set does not include true clusters and was therefore not included in the current paper. Brief descriptions are as follows, and more details for many of the data sets can be found in (Bien and Tibshirani, 2011). A summary of the data is in Table 2.2.
Olivetti Faces This data contains 400 images of 64 64 pixels. There are 10 images each from 40 people. The pairwise distance measure used is distance. Here we use the data from the RnavGraphImageData package in R.
Colon Cancer The Colon Cancer data set contains gene expression levels for 1000 genes for 62 patients, 40 with cancer and 22 healthy. The pairwise distance measure used is correlation. Here we use the data from the HiDimDA package in R.
Prostate Cancer The Prostate Cancer data contains gene expression levels for 6033 genes for 102 patients, 52 with cancer and 50 healthy. The pairwise distance measure used is correlation. There are multiple versions of the data available online and in R packages. The version we use is from https://stat.ethz.ch/~dettling/bagboost.html, and our results match the resulting plots produced in (Bien and Tibshirani, 2011).
Simulations We repeat the simulations done in (Bien and Tibshirani, 2011). These involve three sets of data: spherical, elliptical and outliers. Each data set has 3 clusters of 100 points each in . Both and distances are used as pairwise distance measures. In (Bien and Tibshirani, 2011) simulations were run 50 times each, but here we only ran each once. In future analyses it is possible to perform more runs.
Iris The iris data set (Anderson, 1936; FISHER, [n. d.]) is pre-loaded in R and has been used extensively as an example data set in various applications, including clustering. It contains 50 flowers from each of 3 species. There are four features for each observation, sepal length and width and petal length and width. Here we simply scale and center the features and use distance as a pairwise distance measure.
NBIDE and FBI S&W The National Institute of Standards and Technology (NIST) maintains the Ballistics Toolmark Research Database (https://tsapps.nist.gov/NRBTD), containing images of cartridge cases from test fires of various firearms. These are of 3D topographies, meaning that surface depth is recorded at each pixel location. Each image is approximately 1200 1200 pixels. We use data from two different data sets, NIST Ballistics Imaging Database Evaluation (NBIDE) (Vorburger et al., 2007) and FBI Smith & Wesson. The former contains 12 images each from 12 different firearms, and the latter contains 2 images each from 69 different firearms. We have pre-processed and aligned these images using the R package cartridges3D (available at https://github.com/xhtai/cartridges3D), and extracted a correlation between each pair of images. The resulting pairwise comparison data are available in the clusterTruster package.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Anderson (1936) Edgar Anderson. 1936. The Species Problem in Iris. Annals of the Missouri Botanical Garden 23, 3 (1936), 457–509. http://www.jstor.org/stable/2394164
- 3Bien and Tibshirani (2011) J. Bien and R. Tibshirani. 2011. Hierarchical Clustering With Prototypes via Minimax Linkage. J. Am. Stat. Assoc. 106 495 (2011), 1075–1084.
- 4Cheung et al . (2004) David Cheung, Ian Melhado, Kevin Yip, Michael Ng, Pak C. Sham, Pui-Yee Fong, and S. I. Ao. 2004. CLUSTAG: hierarchical clustering and graph methods for selecting tag SN Ps. Bioinformatics 21, 8 (12 2004), 1735–1736. https://doi.org/10.1093/bioinformatics/bti 201 ar Xiv:http://oup.prod.sis.lan/bioinformatics/article-pdf/21/8/1735/693755/bti 201.pdf · doi ↗
- 5FISHER ([n. d.]) R. A. FISHER. [n. d.]. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS. Annals of Eugenics 7, 2 ([n. d.]), 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb 02137.x ar Xiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb 02137.x · doi ↗
- 6Mechelen et al . (2018) Iven Van Mechelen, Anne-Laure Boulesteix, Rainer Dangl, Nema Dean, Isabelle Guyon, Christian Hennig, Friedrich Leisch, and Douglas Steinley. 2018. Benchmarking in cluster analysis: A white paper. ar Xiv:ar Xiv:1809.10496
- 7Vorburger et al . (2007) T. Vorburger, J. Yen, B. Bachrach, T. Renegar, J. Filliben, L. Ma, H. Rhee, A. Zheng, J. Song, M. Riley, C. Foreman, and S. Ballou. 2007. Surface topography analysis for a feasibility assessment of a National Ballistics Imaging Database . Technical Report NISTIR 7362. National Institute of Standards and Technology, Gaithersburg, MD.
