A close-up comparison of the misclassification error distance and the adjusted Rand index for external clustering evaluation
Jos\'e E. Chac\'on

TL;DR
This paper compares the misclassification error distance and the adjusted Rand index to understand their differences, properties, and what they measure in clustering evaluation through theoretical analysis and simulations.
Contribution
It provides a detailed comparison of two popular clustering evaluation metrics, clarifying their properties and correcting misconceptions.
Findings
The two criteria measure different aspects of clustering quality.
Simulation results reveal distributional differences and biases.
The study clarifies the interpretation of each metric in practice.
Abstract
The misclassification error distance and the adjusted Rand index are two of the most commonly used criteria to evaluate the performance of clustering algorithms. This paper provides an in-depth comparison of the two criteria, aimed to better understand exactly what they measure, their properties and their differences. Starting from their population origins, the investigation includes many data analysis examples and the study of particular cases in great detail. An exhaustive simulation study allows inspecting the criteria distributions and reveals some previous misconceptions.
| Data-based labels | |||
|---|---|---|---|
| True labels | 1 | 2 | 3 |
| Setosa | 50 | 0 | 0 |
| Versicolor | 0 | 48 | 2 |
| Virginica | 0 | 1 | 49 |
| modclust labels | entmerge labels | |||||||
| True labels | 1 | 2 | 3 | 1 | 2 | 3 | 4 | 5 |
| 47 | 197 | 7 | 16 | 7 | 0 | 14 | 214 | |
| 0 | 1408 | 153 | 0 | 146 | 929 | 417 | 69 | |
| 0 | 278 | 1216 | 0 | 1191 | 81 | 63 | 159 | |
| 0 | 62 | 0 | 0 | 0 | 0 | 0 | 62 | |
| 4813 | 2 | 0 | 4809 | 0 | 0 | 1 | 5 | |
| Clustering | |||||
| Clustering | |||||
| 1 | 0 | 1 | 1 | 0 | |
| 0 | 1 | 0 | 0 | 1 | |
| 1 | 0 | 1 | 0 | 1 | |
| 0 | 1 | 0 | 1 | 0 | |
| 1 | 0 | 1 | 0 | 1 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A close-up comparison of the misclassification error distance and the adjusted Rand index for external clustering evaluation
José E. Chacón111Departamento de Matemáticas, Universidad de Extremadura, E-06006 Badajoz, Spain. E-mail: [email protected]
Abstract
The misclassification error distance and the adjusted Rand index are two of the most commonly used criteria to evaluate the performance of clustering algorithms. This paper provides an in-depth comparison of the two criteria, aimed to better understand exactly what they measure, their properties and their differences. Starting from their population origins, the investigation includes many data analysis examples and the study of particular cases in great detail. An exhaustive simulation study allows inspecting the criteria distributions and reveals some previous misconceptions.
1 Introduction
The adjusted Rand index (ARI) introduced in Hubert and Arabie, (1985) is one of the most commonly used measures of performance for clustering evaluation. Indeed, it was the recommended choice in the seminal paper of Milligan and Cooper, (1986), where five criteria were examined regarding the task of comparison of hierarchical clustering algorithms across different hierarchy levels. Their recommendation is based on the fact that, for the null case data (i.e., for a synthetic sample with randomly assigned class labels, showing no significant cluster structure), the ARI was the only index that produced a flat response curve across hierarchy levels, with mean values close to zero, hence indicating that the agreement between the randomly assigned labels and the algorithm solution was due to chance.
Another popular measure for clustering validation, not included in Milligan and Cooper’s study, is the misclassification error distance (MED). Its first appearance in the literature dates back at least to Régnier, (1965), where it was introduced as a distance between partitions of a finite set, and it was called transfer distance. It is also referred to as partition distance (Gusfield,, 2002) or maximum matching distance (Rossi,, 2015). Many papers concerning clustering evaluation indeed contain detailed comparisons of both, the ARI and the MED, showing arguments in favour of one or the other; see, for instance, Steinley, (2003, 2004), Denœud and Guénoche, (2006) or Meilă, (2005, 2007, 2016).
Whereas Steinley, (2004) supports Milligan and Cooper’s recommendation by inspecting the performance of the ARI and the MED on an exhaustive simulation study, Meilă, (2016) suggests that the MED “comes closest to satisfying everyone” in terms of its properties and ease of interpretation, Denœud and Guénoche, (2006) suggest that the MED is much appropriate for small sample sizes from their study of all the clusterings at a close number of transfers from a given one, and von Luxburg, (2010) considers the MED as “the most convenient choice from a theoretical point of view”.
It must be stressed that both criteria are commonly categorized as “external”, in the sense that they are used to measure the performance of a data-based clustering algorithm against a true cluster structure, known in advance in a simulation scenario or after a data inspection by an expert, which is taken as the ideal clustering solution, but is external to the clustering methodology itself. Internal criteria (such as those based on cohesion, entropy, cluster separation, etc) are also frequently used, but they will not constitute the focus of this paper; see Hennig, (2019) for a thorough review of internal cluster validation indexes.
This paper aims to provide further comparisons between the MED and the ARI, at several levels. Indeed, many other external criteria could be considered as well, and they are also reviewed in the aforementioned comparative studies, but here the discussion is restricted to the former two because they are usually recognized as the main criteria used in practice. The close-up inspection examines a wide range of features: Section 2 first glances through their population origins (i.e., their counterparts in the case where the true underlying data distribution is fully known) and then elaborates on their traditional, and more common, data-based versions. The comparison of these empirical analogues is the subject of Sections 3 (theoretically) and 4 (by simulations). The theoretical study comprises their computation, some illustrations by means of simple examples, and an analysis of their extreme values in relation to the case of independent clusterings. The simulation scenarios investigate the distributions of the criteria in the null case and how they evolve as two clusterings become apart from perfect agreement. Finally, Section 5 discusses the new findings and their implications.
2 Population and empirical distances between clusterings
2.1 The population version of cluster analysis
Cluster analysis is mostly posed as a sample problem, and perhaps that is one of the reasons why many authors have called attention to the lack of theoretical results for clustering (Milligan,, 1996, von Luxburg and Ben-David,, 2005), as opposed to regression or classification, where the population background is much more clearly established.
Traditionally, the goal of clustering techniques is to provide a partitioning of a data set into groups. For that goal, it suffices to have an algorithm which is appropriate for the data set at hand. However, from a statistical perspective, such a given data set is not simply a set of points in the space, but a sample from some probability distribution . Hence, the goal of clustering methodologies can not reduce to partitioning only the data set at hand, but it must provide a mechanism to assign group labels to any point in the space; or, at least, to all the points in the sample space, since they could have been equally drawn as sample points. Such a view of clustering is shared by many authors, including Györfi et al., (2002, p. 245), Ben-David, von Luxburg and Pál, (2006), Klemelä, (2009, p. 196), Chacón, (2015) or Wasserman, (2018, Section 2.3).
Hence, the object that clustering algorithms should produce is not only a partition of the data set, but a whole-space partition. This means that if denotes the sample space, a whole-space clustering is a class of sets such that for all and . Indeed, most existing clustering methodologies are able to produce this type of object; this is the case, for instance, for -means clustering, modal clustering or mixture model clustering (see Chacón,, 2015). Obviously, any partition of induces a partition of the observed data set as well. To avoid confusions, these are referred to as a whole-space clustering and a clustering of the data, respectively. Also, note that both objects can have a population version (the partition that would be made if the true underlying distribution were fully known) and a data-based version (the partition that would be made after observing the data).
That made clear, to evaluate the performance of clustering methods from a statistical point of view it is necessary to employ a distance between whole-space clusterings. While there exist many notions of distance between partitions of a finite set (Day,, 1981, Meilă,, 2016), proposals to serve as a distance between whole-space clusterings do not abound in the literature. Two of them are described next.
First, since the parts of a clustering (i.e., the clusters) are sets, it seems natural for distances between clusterings to be built upon a notion of discrepancy between sets. A usual way to express the discrepancy between two sets and is by quantifying the content of their symmetric difference . This difference is defined as the elements that and do not have in common; that is, . Then, taking into account the distinctive features of a partition, this natural distance between sets can be extended to define a distance between two clusterings and , by adding up the contributions of the regions that their most similar clusters do not have in common. Specifically, Chacón, (2015) defined the distance in measure between and as
[TABLE]
where is the set of all permutations of elements and, without loss of generality, it is assumed that so that would be enlarged by adding empty sets if necessary. More intuitively, represents the minimum probability mass that needs to be moved (or re-labeled) to transform into , or viceversa.
The above is a clustering distance, in the sense of Ben-David, von Luxburg and Pál, (2006, Definition 3). Nevertheless, these authors considered a different distance between whole-space clusterings, , which they called Hamming distance. This second distance is more closely related to the Rand index (as detailed below), since it is defined as the probability that two independent random observations (drawn from ) belong to the same cluster with respect to one of the clusterings and to different clusters with respect to the other clustering. Hence, it can be shown that an explicit expression for this Hamming distance is
[TABLE]
The dependence of this measure on squared probabilities may appear somehow unnatural, but it is a consequence of the fact that it is based on comparing the cluster labels of pairs of points.
2.2 Comparing two clusterings of the data
In a simulation setting, where the true underlying distribution is fully known, it is possible to compute the ideal population clustering; that is, the whole-space partition that would be made on the basis of this knowledge of (this ideal partition varies from one methodology to another, depending on the notion of cluster that they seek after). Hence, it is natural to evaluate the performance of a clustering technique by means of the distance from the produced data-based clustering to its population counterpart. Since both are clusterings of the whole space, any of the previously mentioned distances between whole-space clusterings can be employed.
Of course, things are different when dealing with real data. Suppose that we have observed data points . Even if the usual methods are able to produce whole-space clusterings with the sole information provided by , the fact that a clustering distance depends on (Ben-David, von Luxburg and Pál,, 2006), which is unknown for real data sets, implies that to compute the clustering distance in practice it is necessary to replace by the empirical distribution , which assigns probability mass to each data point. This means that only the labels of the data points are used in the comparison between the two clusterings, so that a distance between whole-space clusterings becomes in fact a distance between two clusterings of the data.
When this reasoning is applied to the two distances in the previous section, it results in two well-known distances between partitions of a finite set. To see this, given two partitions and of , with , denote by , and the cardinalities of , and , respectively. The -matrix is known as the confusion matrix (or contingency table), and the vectors and constitute its row-wise and column-wise margins, respectively. Then, taking into account that , it follows that the empirical version of the distance in measure (1) is
[TABLE]
which coincides with the definition of the misclassification error distance (see Meilă,, 2005), so that it will be denoted as henceforth (or simply MED, if it is obvious which clusterings are being compared). The MED inherits from its population version a clear interpretation as the minimum proportion of data points that would need to be re-labeled so that and coincided, and that is why it is also known as transfer distance (Régnier,, 1965).
On the other hand, the empirical equivalent of the Hamming distance (2) is
[TABLE]
which is also known as equivalence mismatch coefficient (Mirkin and Chernyi,, 1970; Mirkin,, 1996, p. 241) or as -invariant Mirkin metric (Meilă,, 2016). Being a sample equivalent of (2), equals the proportion of pairs that belong to the same cluster in one of the clusterings and to different clusters in the other clustering. Note that, somehow artificially, this empirical version of the Hamming distance is taking into account data pairs of type as well.
In statistical terms, if are independent random variables with distribution and we denote by the indicator function of a set , the squared probability appearing in (2) is estimated in (3) by the observed value of the -statistic . However, -statistics theory (Lee,, 1990) shows that a better estimate of is . Reasoning similarly for the other terms in (2) and making these changes everywhere in (3) yields the definition of the Rand distance
[TABLE]
which equals the proportion of unordered data pairs that belong to the same cluster in one of the clusterings and to different clusters in the other clustering (see Filkov and Skiena,, 2004). This distance was called symmetrical difference distance in Denœud and Guénoche, (2006), and it is also considered in Azizyan et al., (2015), under the denomination pairwise clustering loss. In any case, it is not hard to check that , so in fact there is little difference between these two empirical versions of .
Instead of measuring the dissimilarity between clusterings using a distance, clustering comparisons can be based on indices that quantify the agreement between them, with values close to 1 indicating greater similarity. In this sense, the Rand index (Rand,, 1971) is defined as . An important feature of the RI is that it also has a clear interpretation as the proportion of unordered data pairs that either belong to the same cluster or to different clusters in both clusterings. However, Fowlkes and Mallows, (1983) noted that, when comparing two clusterings with , the range of possible values of the RI is quite narrow and its expected value quickly approaches 1 as . This expectation is meant with respect to a random choice of the entries of the confusion matrix, while keeping its margins fixed, intended to reproduce a null scenario corresponding to independent clusterings. To amend this problem, Hubert and Arabie, (1985) proposed to correct the RI for chance, so that it yields an expected value of zero in such a null scenario, and introduced the adjusted Rand index . Milligan and Cooper, (1986) showed that, in addition, this correction also results in a much wider range of possible values for the ARI over the RI.
The most notable loss along this correction is the interpretation; for example, it is not easy to discern what an ARI value of means, or if an ARI of for two clusterings denotes a higher agreement between them than an ARI of for a different pair of clusterings, since the baseline could be different. Precisely, in a series of papers (later collected in a single volume), Goodman and Kruskal, (1979) emphasized the importance for association measures in cross classifications to have a clear operational interpretation. Besides, Wallace, (1983) raised some doubts with respect to the choice of the null scenario in the computation of and, more recently, Gates and Ahn, (2017) showed that the use of different null models for index adjustment can lead to disparate conclusions.
Despite these drawbacks, the ARI is one of the most popular and employed indicators for clustering comparison, in close competition with the MED. Hence, one of the main contributions of this paper is to provide a detailed inspection of both of them, via simple examples to help understanding their behaviour and their differences. Additional references concerning deep investigation of these criteria include Warrens, (2008), Steinley, Brusco and Hubert, (2016) and Steinley and Brusco, (2018), in the case of the ARI, and Charon et al., (2006), Charon, Denœud and Hudry, (2007) and Denœud, (2008) regarding the MED.
Here, since the MED is a distance and the ARI is an index, to facilitate their comparison the ARI will be previously transformed into a distance, which will be called the adjusted Rand distance and is defined as
[TABLE]
that is, as the Rand distance normalized by its expected value under the null model. This way, the ARD has unit expected value under the null model.
3 Detailed comparison of the MED and the ARD
In the following, several aspects of the MED and the ARD will be compared in detail. First, explicit computation of the two criteria is addressed. Then, the differences between the two are illustrated through several specific examples. Next, an exhaustive study of the simplest case of a confusion matrix is provided, with emphasis on exploring the most dissimilar situation between two clusterings. Finally, some of the lessons learned from the case are generalized for two clusterings of arbitrary size.
3.1 Computation
One undeniable advantage of the ARD over the MED is its simpler definition, which readily translates into a much simpler computation.
Indeed, let us write if the data points and belong to the same cluster in (and otherwise), and consider the cardinalities of the sets of (unordered) data pairs that cover all the possibilities of belonging either to the same or to different clusters in and , denoted
[TABLE]
Then, it is clear that . Moreover, Steinley, (2004) provided the very simple formula , which entails that , so that
[TABLE]
This is very easy to implement, taking into account that , , and can be immediately computed from the confusion matrix (Jain and Dubes,, 1988, Section 4.4.1).
In contrast, computation of the MED requires solving a discrete minimization problem over possible inputs, so its implementation is not that simple, which surely hinders its usage. To fully describe the problem, assume that and define for all (if any). Writing for , then computation of the MED involves finding , where denotes the set of all possible permutations of elements. Despite its apparent complexity, this is a form of the well-known assignment problem, and very efficient algorithms exist to find its solution (see Burkard, Dell’Amico and Martello,, 2009). Appendix A below offers a simple implementation using the popular R language (R Core Team,, 2019).
3.2 Examples
To help understanding what the ARD and the MED represent and how they are computed in practice it is useful to start with some simple real data examples.
The first example regards the famous iris data set (Anderson,, 1935), including 4 measurements on flowers of three species of iris: Iris setosa, versicolor and virginica. If these data are clustered, e.g., using a normal mixture model (Fraley and Raftery,, 2002) with components, it results in the confusion matrix given in Table 1.
Thus, since only 3 data points would need to be re-labeled for the two partitions to coincide. On the other hand, there are data pairs that belong to the same cluster in one of the partitions and to different clusters in the other, and that accounts for a proportion of of the total number of possible data pairs. Finally, using Equation (5) the adjusted Rand distance for those two partitions is .
This is a very simple example because the two partitions have the same (small) number of clusters, and besides, they are quite similar. Nevertheless, it is helpful to perceive the differences between the MED, the RD and the ARD. Here, perhaps the MED is the easiest criterion to compute and to interpret, since it only involves counting misplaced individual data points. Obtaining the RD from the confusing matrix (by eye) is a bit more complex, since it implicates counting data pairs. And the corrected version ARD lacks the interpretability of the former two, but it still yields a very small number, indicating that the two partitions have a high degree of agreement.
Our second example concerns the DLBCL data set, introduced in Aghaeepour et al., (2013). It contains the records of the CD3, CD5 and CD19 antibodies on a set of cells of a patient with Diffuse Large B-cell Lymphoma (DLBCL), along with the true cluster labels in five groups ( to ) manually found by an expert. In Chacón, (2019), this data set was analyzed using several component merging techniques for mixture model clustering, in particular through the so-called modclust and entmerge methods. The former suggested the existence of three clusters, while the latter correctly identified five clusters; both confusion matrices are given together in Table 2.
Regarding the confusion matrix for the modclust labels, again it is not hard to compute the MED: the group matching leading to a higher degree of agreement would be with 2, with 3 and with 1, whereas the remaining data points would need to be re-labeled to make the two partitions coincide, thus yielding . Similarly, for the entmerge labels it can be checked that , so that the modclust clustering is closer to the true expert labels regarding the MED, despite showing a smaller number of clusters. The reason is that, despite the entmerge method returned the true number of clusters, its assignments to clusters 4 and 5 were so unfortunate (especially, the splitting of cluster into two significant groups in clusters 3 and 4) that a high number of re-labelings is needed to make this partition equal to the true one. In contrast, the ARD for these two confusion matrices can be computed to be and for the modclust and entmerge partitions, respectively. As noted before, this does not yield such an intelligible comparison regarding the relative closeness of the two data-based partitions to the true clustering, because the baseline is different for the two contingency tables. Nevertheless, it must be noted that the unadjusted distances and (respectively) also suggest that, in terms of data-pair disagreements, the entmerge clustering seems to be slightly closer to the expert partition than the modclust one.
The two previous examples illustrate the common scenario in real data analysis, where data-based partitions are not too dissimilar from the true clustering. To finish this section, a synthetic example concerning quite distant partitions is examined. The confusion matrix shown in Table 3 corresponds to the two assignments of objects into clusters in Table 2 in Steinley, (2003).
To appreciate how distant these two clusterings are, it is worth noting that , greater than 1, meaning that the disagreement between the two is higher than the average that would be obtained if the labels were randomly assigned (following the null model). The number of data pairs that are in the same group in one clustering and in different groups in the other can be computed to be 22, out of the total of possible data pairs, which leads to . And, by considering any permutation of the columns of the confusion matrix that preserves all its diagonal entries as 1, adding up the off-diagonal figures leads to .
This example further illustrates how counting “discordant” data pairs seems to be less intuitive than counting “discordant” individual data points. But also, it shows that the permutation for which the MED is attained may not be unique: for instance, rearranging the columns of the confusion matrix according to the permutation yields the same MED value, as already noted in Steinley, (2003, 2004). In any case, it is easy to check that the values of the RD and the ARD also remain the same under that permutation. Such a phenomenon is expected to occur for the comparison of very dissimilar clusterings; for example, in the extreme case where the confusion matrix has all its entries equal to 1 (representing independent label assignments), then any permutation of its column leads to the same MED, RD and ARD values.
3.3 Two clusters in each clustering
In order to gain a deeper understanding of the behaviour of the MED and the ARD the next step is to analyze in detail the simplest scenarios. Arguably, the simplest comparison between two clusterings arises when either or , but that could be considered a degenerate case, since in fact one of the partitions would show no clusters. So the next simplest case is ; we will focus our attention on this case first, and then we will generalize some of our findings to the case of arbitrary and .
Independently of the criterion employed to compare clusterings, any researcher would probably agree that having a diagonal confusion matrix is synonymous with a perfect agreement between the two partitions. But that is also the case if the confusion matrix is anti-diagonal, which means, for , that
[TABLE]
This clearly illustrates a key difference between classification and clustering: since classification is a supervised learning problem, the training data are already equipped with precise-meaning labels, and hence an anti-diagonal confusion matrix must be interpreted as the result of a totally wrong classification; in contrast, a clustering algorithm labels the groups as it finds them and, hence, the coding designation is not important (group 1 might as well have been called group 2, and viceversa) so that an anti-diagonal confusion matrix also represents perfect agreement, since the discovered groups are exactly the same, only differing in their (arbitrary) denomination. Mathematically, this means that distances between clusterings must be invariant with respect to permutations of the cluster labels (Meilă,, 2012).
It is precisely the way to measure deviations from the diagonal or anti-diagonal situation what gives rise to the different distances between clusterings. For the case , let us consider the confusion matrix , and denote and the total sum of its diagonal and anti-diagonal entries, respectively. In this case, it is not hard to show that the MED and the RD can be simply expressed as
[TABLE]
To graphically appreciate the differences between the MED and the RD, and noting that , Figure 1 shows the possible values of these criteria for , as a function of . The linear and quadratic appearances of the MED and the RD, respectively, are explained by the fact that they can be equivalently expressed as and .
On the other hand, it is not possible to express the ARD as a function of and only. For a given value of , there exist configurations of the confusion matrix that result in different ARD values. Figure 1 also shows all these possible ARD values for each given (marked with a tick over the whole possible range, that is indicated with a vertical line). This reveals a somehow erratic behaviour of the ARD in some cases, and inspecting such cases more closely allows to clarify how the ARD works. For instance, for consider the confusion matrices
[TABLE]
The two matrices have , so that and for both and . However, for whereas for . In the first configuration, in both clusterings there is a big cluster with 18 elements and a relatively small one with only 2 elements; both clusterings agree on most of the elements in the big cluster, but show no agreement at all regarding the small cluster, since none of the data points has been simultaneously assigned to the small cluster in both clusterings. In the second configuration, the first clustering presents two quite balanced clusters, say , of sizes 11 and 9 (respectively), while the second clustering has clusters of sizes and , which can be obtained from by transferring 4 elements from to . The ARD seems to penalize the first configuration much more severely than the second one.
3.3.1 Worst-case scenario
The previous formulas for the MED and the RD in terms of and are also useful to analyze the worst-case scenario; i.e., the situation in which two given clusterings are as dissimilar as possible. If is even, then the maximum possible MED is and it is attained for . Thus, it is worth remarking that even for the two most dissimilar possible clusterings the MED is not going to be higher than for the case of . This could make a case against the use of the MED, since one would expect this distance to attain a maximum of 1 when comparing the most dissimilar clusterings. However, a moment of reflection reveals that this maximum of makes perfect sense in the context of clustering comparison, due to the aforementioned feature that any cluster label permutation should not affect distances between clusterings: having a proportion of label disagreements greater than would mean than exchanging the label denominations we would get a proportion smaller than .
Nevertheless, it is helpful to keep the value of the maximum possible distance in mind at the time of judging how far two clusterings are: a MED of always has the same interpretation, but in relative terms it represents a worse result if the maximum possible MED is than if it is . Hence, this suggests the introduction of a normalized MED, defined as , to record how large the MED is with respect to its maximum possible value (given fixed values of , and ). This should not replace the unnormalized MED, since they offer different information, but they should be given together. In the previous example, having , versus , indicates that the former situation is closer to the case of totally dissimilar clusterings than the latter one. Notice that this is a very different adjustment from the usual one, since it is not based on the expected value of the index under some null model; indeed, it does not rely on any choice of a null model.
The difficulty of such a normalization is that it is necessary to analyze which is the worst-case scenario for each index. Continuing with the table, it is not hard to check that if is odd, which is attained for both and . Therefore, for even and for odd .
Regarding the RD, its maximum is attained at the same value of as for the MED, resulting in for even and for odd , so that it approaches as increases. Hence, the normalized RD, defined as , can be explicitly written as for even and for odd .
For the ARD, it would have been expected that its maximum were attained amongst the possible configurations with for even (or for odd ), but Figure 1 shows that this does not happen, in general. For instance, for the maximum ARD is attained for a configuration with ; more precisely, for , which gives . It is somehow counterintuitive that the maximum value of the ARD is not attained for , which represents the situation where the labels of the first clustering are perfectly independent from the labels in the second clustering.
In fact, it would be interesting to study which are the possible maximum and minimum values of the ARD for a given . Since and are fixed, the numerator in the definition of the ARD is constant, so this problem is equivalent to finding the minimum and maximum values of for a given . It appears (although it was not possible to find a simple proof) that for a given , the maximum ARD is attained for
[TABLE]
provided is even or odd, respectively, and that for the confusion matrix configuration that maximizes the ARD is . Using the form for even , the resulting maximum ARD for a given can be expressed as
[TABLE]
for . Maximizing with respect to yields , but it is not clear how to obtain an explicit expression for such a maximum. In any case, note that it is not necessary to normalize the ARD, since this distance already includes a kind of normalization (although by the expected value of the RD, not by its maximum).
3.3.2 Close clusterings
Similarly, this in-depth inspection of the case is also beneficial to understand how these measures of dissimilarity between two clusterings evolve when such clusterings are very close. All these distances obviously return a zero value if , but the question that will be addressed here is how these distances behave as and before reaching their null limit. More precisely, the goal is to provide a linear approximation of the MED and the RD for small values of and .
Such an approximation is very easy to find for the MED, since as both it is clear that , so that for small values of and (this is an equality rather than an approximation). On the other hand, it is possible to write {\rm RD}={n\choose 2}^{-1}\big{\{}n(n_{12}+n_{21})-(n_{12}+n_{21})^{2}\big{\}}, so that a Taylor expansion gives as . This means that, for small values of and , the RD will be roughly twice the MED.
For instance, for we have and , while the approximation formula for the RD reads .
3.4 Arbitrary number of clusters
The case is surely the easiest one to analyze in detail, and its analysis results in a deeper understanding of how the MED, the RD and the ARD behave. Here, such an analysis is extended for the comparison of two clusterings with an arbitrary number of clusters.
One of the findings for is that the MED attains its maximum when the clustering labels are perfectly independent. In general, this refers to the situation where is a multiple of and the -confusion matrix has all its entries equal to . In that case, note that for any and, therefore, . But, assuming , Charon et al., (2006, Lemma 1) showed that an upper bound for the MED is (with standing for the ceiling function), which generalizes the bounds obtained for odd or even in the previous section for . Hence, once again the maximum MED is attained for the case of perfectly independent clustering labels. Thus, the corresponding normalized MED is defined in general as , where . The effect of this normalization is noticeable for small values of , but becomes negligible as increases.
In contrast, the story for the RD with arbitrary is different from the case . An exhaustive enumeration of all the possible confusion matrix configurations for small values of , and suggests that, given and , the maximum value of the RD is always attained for a matrix of the form
[TABLE]
with and (it remains open to find a formal proof of this fact). This does not mean that the maximizing matrix is necessarily unique; in fact, as noted before, for and the maximum RD is attained for any confusion matrix with , for instance for which has the form (7), and also for , which represents the perfectly independent situation. In addition, assuming that the conjectured form of the maximizer is correct, it is shown in Appendix B that is attained by taking , the next coordinates and the remaining coordinates , where (with standing for the floor function) and ; that is, and are the quotient and the remainder of the (Euclidean) division of by , respectively. With such a choice, it follows that
[TABLE]
This allows to explicitly define the normalization .
Further, when is a multiple of and the two clusterings are perfectly independent it is easy to check that . Since as , it follows that the maximum is not attained for perfectly independent clusterings for big enough if . Moreover, in practice this seems to be the case for all , as shown in Figure 2. This figure represents the normalized RD for the case of two perfectly independent clusterings for several combinations of and . Only for the pair the RD for perfectly independent clusterings matches its maximum possible value. For any other combination, having two totally unrelated clusterings does not yield the maximum possible RD; indeed, this phenomenon becomes more and more severe as increases and, for instance, for and the confusion matrix with all its entries equal to 4 results in a RD that is only of the maximum achievable RD, attained for a matrix of the form (7) with , and .
Finally, it is worth noting that decreases as and/or increases, so that the RD for the case of perfectly independent clusterings becomes quite small when both and are large and, hence, the RD does not seem useful to detect this important instance of unrelated clusterings. Fowlkes and Mallows, (1983, p. 555) already noted this phenomenon, upon inspecting the expected value and variance of the Rand index under the null model.
For the ARD, it was not possible to provide an explicit formula for its maximum for , and the problem is of course more intricate for arbitrary and . Nevertheless, it seems clear that the maximum ARD is not attained for the case of independent clustering labels, in general. Instead, the inspection of all possible confusion matrix configurations for small values of , and sufficiently large seems to suggest that the maximum value of the ARD is always attained for a matrix of the form
[TABLE]
with and , (furthermore, with if ). Indeed, confusion matrices with ARD larger than the value corresponding to the perfectly independent case can be constructed by following the guidelines described above for . For instance, if , , , then the confusion matrices
[TABLE]
lead to ARDs of and , respectively, and for , , , the confusion matrices
[TABLE]
yield ARDs of and , respectively. Moreover, when is a multiple of and the two clusterings are perfectly independent, it is easy to show that , which approaches 1 (from above) as increases.
4 Numerical experiments
In this section, the distributions of the MED, the RD, the ARD and the normalized versions NMED and NRD will be compared in different simulated scenarios.
As noted in Van Mechelen et al., (2018), benchmarking studies for cluster analysis do not abound. Nevertheless, the task of comparing different external criteria via simulation was addressed in the seminal paper by Milligan and Cooper, (1986) and also more recently in Steinley, (2004), Denœud and Guénoche, (2006) or Steinley and Brusco, (2018).
Broadly speaking, these studies handle two possible scenarios. The first one explores the performance of the criteria in the null case, that is, when the agreement between the compared clusterings is only due to chance. And the second framework concerns how the criteria of interest behave as the two compared clusterings drift apart, starting from perfect similarity. Both scenarios are considered separately in the next sections.
4.1 The null case
As noted above, the null case scenario covers the situation where the clustering agreements are solely due to chance. However, as remarked in Gates and Ahn, (2017), different choices for the model for random clusterings can be made, and a careful model selection is needed to provide a baseline that is neither based on a model that “is not random enough” nor on a model that is “too random”.
These authors considered three models for random clusterings, with increasing level of randomness, starting with the permutation model (where the number of clusters and their sizes is fixed), followed by the model where only the number of clusters is fixed, and finally the model encompassing all possible clusterings, with arbitrary number of clusters and cluster sizes. As a compromise for intermediate randomness level, in this section the null case refers to the situation where random labels are drawn uniformly after fixing the number of clusters.
Hence, the distribution of the considered criteria in the null case is explored by computing their values on a big enough number of random clustering pairs of objects, obtained by independently drawing two uniform samples of size from and , respectively. The number of synthetic replicates was set to , in order to obtain a precise approximation of the distributions; the number of clusters was considered equal (), in common with some of the aforementioned previous studies, and ranging in ; and samples sizes and were used to investigate the effect of an increasing number of data points. The distributions of the studied criteria are depicted in Figures 3 and 4 for and , respectively, by means of side-by-side vertical histograms (with the bars mirrored with respect to the vertical axis), whose bars have been rescaled so that each of them has maximum bar length equal to 1, to aid visualization.
Despite being corrected for chance according to the permutation model (which is not exactly the null model in this study), the distribution of the ARD seems to be centered at 1 in all cases, so this type of adjustment makes it possible to compare its behaviour along the different configurations. Its variability is the second lowest among the compared criteria, it seems not to change with the number of clusters but quickly decreases with the sample size, as also noted in Steinley, Brusco and Hubert, (2016). This suggests that the ARD may give rise to a powerful tool for detecting clustering independence.
The RD is the least variable criterion out of those considered here. This is not surprising in view of the previous Figure 1, since its quadratic nature entails a least pronounced descent around its null-case value than the MED, for instance. Besides, as remarked in the previous section, under this null scenario the RD only achieves its maximum value for the case , which yields a tightly concentrated distribution of the NRD with a maximum of 1 in that case. However, as previously shown in Figure 2, the maximum possible value of the RD becomes quite bigger than its value for independent clusterings as the number of clusters increases, and this explains why even the distributions of the normalized RD are far from 1. In other words, confusion matrices corresponding to randomly generated clusterings are usually far from something like (7). Probably, it might be possible to obtain NRD distributions much closer to 1 if the random clusters were generated to produce confusion matrices only slightly deviated from (7), but that does not seem to be an appropriate null model.
The MED is notably more variable than the ARD and the RD, with standard deviations about 1.6–2.1 times greater than those of the ARD, and 3.5–4.2 times greater than those of the RD, for (3.1–4.2 and 6.7–8.5, respectively, for ). Its variability, though, appears to decrease slightly as the number of clusters grow. Its approximated distribution shows an upper bound that agrees with the results in the previous section (for , e.g., a maximum value of 0.5, 0.66, 0.75 and 0.8 for , respectively), yielding location features that naturally change with the number of clusters, and hence making inappropriate to aggregate its results across the different simulated configurations. This upper bound also implies that NMED certainly attains a maximum value of 1 for this null scenario of random clusterings. However, it must be pointed out that the probability of attaining such a maximum value seems to decrease with the number of clusters.
Indeed, in some cases it is possible even to give an exact expression for such a probability. For and even , for instance, it corresponds to , where is the sum of the two diagonal terms in the confusion matrix. In the null scenario, is a random variable following a binomial distribution, with as the number of trials and probability of success (the probability that two uniform and independent choices from are the same). Hence, P(d_{1}=n/2)={n\choose n/2}\big{/}2^{n}. More generally, here the random variable follows a folded binomial distribution (Gart,, 1970).
4.2 Diverging clusterings
The second simulation scenario concerns studying the evolution of the compared criteria as two clusterings move away from each other, starting from a situation of perfect agreement.
Interestingly, in most of the existing simulation studies (see, for instance, Steinley,, 2004, Denœud and Guénoche,, 2006), the process of “moving away from each other” is quantified by measuring the proportion of data points that are differently clustered from the initial stage of perfect agreement. Steinley, (2004) called this proportion the “degree of overlap”, and more recently Steinley and Brusco, (2018, Section 3.2.2) referred to this measure of deviation from the perfect agreement as the misclassification rate. So, overall, this simulation scenario concerns inspecting how the other clustering distances evolve as compared to the MED (see Figure 3 in Steinley,, 2004, or Figure 1 in Denœud and Guénoche,, 2006).
Aside, it should be noted that what Steinley, (2004) and Steinley and Brusco, (2018) called degree of overlap is not exactly the same as the MED. The simulation setup in these references concerns a diagonal confusion matrix as the starting point (hence, a situation of perfect agreement) that is progressively perturbed by randomly taking a proportion of objects from the diagonal and placing them in off-diagonal cells. This proportion of off-diagonal objects is what is called degree of overlap (DO), and in the aforementioned studies it is allowed to vary in . However, this is not the same as the misclassification rate: while the DO and the MED usually coincide for low DO values, when the DO is too high it may occur that the resulting clusterings became indeed closer with respect to the MED, instead of further away. For example, consider the confusion matrices
[TABLE]
For all of them, . Matrix stems from after removing a total of 13 objects from the diagonal, so that for ; it can be checked that for as well. Five additional objects are removed from the diagonal when going from to , representing a total with respect to , but for , lower than for . Of course, this is due to the fact that for and , so that it does not seem appropriate to consider DO values greater than in this case.
In Denœud and Guénoche, (2006), several agreement indices were compared as a function of an increasing MED. A given starting partition is recursively perturbed by randomly selecting one element and a new class label for it. This procedure aims at randomly and equiprobably generating partitions at precise MED of the given one. However, there the number of clusters is not fixed and, hence, their study comprises a higher degree of uncertainty.
Nevertheless, for the goal of inspecting the evolution of the different distances as a function of the MED, the most exhaustive procedure is surely that based on computing the involved measures for all the possible confusion matrix configurations, that is, for all the matrices in for all , for all , and . Indeed, this is precisely what Figure 1 represents for and . But this could be accomplished in that case because the cardinality was reasonably small.
For and the class of all possible confusion matrices is considerably larger, namely , but still not prohibitive, so its exhaustive enumeration is yet feasible. Therefore, the ARD, the MED and the RD were obtained for each of these possible confusion matrix configurations, yielding a large amount of interesting information. Figure 5 shows boxplots for the conditional distributions of the RD (left) and the ARD (right), given the MED, along with the (mean) regression curve. These plots contain the same information as Figure 1, but a first notable difference is that now the RD corresponding to a given MED is no longer a single value, as it happened for ; instead, for all the possible confusion matrices with the same MED result in a wide range of different RD values.
Most of the conditional distributions given the MED are fairly symmetric, but it is worth remarking some interesting features that arise, especially, for very low or very high MED values. For instance, Figure 6 focuses on the distribution of the RD given the particular values of (left) and (right), which clearly show a high degree of skewness. The conditional distribution of the ARD also shows some peculiarities: its maximum value () is attained for a confusion matrix with , but its conditional mean attains its maximum at (the maximum possible MED value). The outlier in the conditional distribution of the ARD given is particularly striking, with a value of , attained at the confusion matrix
[TABLE]
despite it is only at 4 data point transfers from the perfect agreement situation.
This extensive enumeration study is also useful to inspect the individual distributions of each criterion. For example, Figure 7 shows the distribution of all the MED values for and , and reveals a very different scenario from the one that Figure 1 in Steinley, (2004) suggested. In Steinley’s simulation study, the distribution of the MED appeared to be somewhat uniform, which strongly contrasts with the distribution shape shown in Figure 7 from exhaustive enumeration. The reason is that in Steinley’s paper the distribution of the MED is investigated by aggregation of all the multiple simulation conditions. And, as noted before, these simulation conditions involved uniformly varying the DO from to . It was already noted above that the DO is not exactly the same as the MED, but they are closely related, so forcing a fixed given number of simulations for every DO level naturally results in a (nearly) uniform distribution for the MED. However, the exhaustive inspection of all the possible confusion matrix configurations in Figure 7 shows that the MED distribution is quite different from the uniform one.
For higher values of , or , it is not possible to enumerate all the confusion matrices in anymore, since its cardinality becomes exorbitant. An alternative way to approximate the criteria distributions, for these higher values of , and , would be to randomly sample a large number of matrices from (in an equiprobable way), and then compute their MEDs, RDs and ARDs in order to obtain an equivalent approximation of Figure 5. This suggestion is not without problems, either, for two reasons: first, it is not straightforward to uniformly sample from , see Appendix C for a valid procedure; and second, the fact that some of the possible MED values occur only for a low number of confusion matrices makes it difficult to procure an accurate approximation of the conditional distributions given such MED values. For example, from the exhaustive enumeration of it follows that the probability of obtaining a confusion matrix with by uniform sampling is approximately , so a large simulation size would be required in order to approximate the distribution of the other distances given .
In any case, following the procedure suggested in Appendix C, a random sample of size was drawn from , and the values of the MED, RD and ARD for these confusion matrices were recorded. It must be remarked that the cardinality of is approximately , which makes the exhaustive enumeration approach unfeasible. From that sample, it is possible to approximate the conditional means of the RD and ARD given the MED (Figure 8, left) and to provide an approximate analogue of Figure 7 for and (Figure 8, right). Notice also that, even if the possible MED values are (because for and ), the range of MED values for which 10 or more observations were obtained in this particular sample reduced to (i.e., the others had sample frequencies smaller than ), and that is why Figure 8 presents some missing parts.
5 Discussion
Regarding the task of comparing two partitions of a finite data set, surely the confusion matrix is the object that yields the most complete information. However, when it has a considerable number of cells, it provides somehow too many details and it becomes necessary to resort to some summary statistic to extract useful information. The MED, the RD and the ARD are examples of such summary statistics, each of them offering a different synopsis.
The first two represent empirical versions of distances between whole-space clusterings and, intuitively, correspond to computing the proportion of “differently placed” individual data points (in case of the MED) or data pairs (for the RD) along the compared clusterings. Considering data pairs instead of individual data points appears somehow less intuitive. In fact, as pointed out in Hubert and Arabie, (1985, Section 4), one could equally consider using data triplets or, more generally, data -tuples. Comparisons, however, become more and more intricate as increases. It is in this sense that the choice of (i.e., the MED) represents the simplest option.
But it is not just a matter of simplicity. As long as , the RD also shows a more serious and undesirable drawback: the case of completely unrelated clusterings does not correspond to the most dissimilar clustering pair, according to the RD, and this phenomenon becomes more and more severe as the clustering sizes increase (as shown in Figure 2). This unfortunate feature is not shared by the MED, which does point out unrelated clusterings as an instance of extreme dissimilarity, and furthermore shows a maximum value that quickly approaches 1 as increases.
A possibility to correct the aforementioned flaw is to consider the relative size of the RD with respect to the average RD value when the two clusterings are generated at random, this is what the ARD provides. This exhibits the natural advantage of creating a criterion that is always centred at 1 for the null case, but on the other hand introduces a distorting element that further complicates the interpretation: now the ARD represents the relative size of the proportion of differently treated data pairs in the compared clusterings with respect to a baseline, taken as the average value of that proportion when the cluster labels are assigned at random while maintaining the number of clusters and cluster sizes fixed. This also implies that, if the baseline changes (as usually happens when inspecting two different confusion matrices), then the relative comparison of the two scenarios by means of ARD scores becomes unclear.
An alternative remedy, also aimed to examine the relative size of a criterion, but this time against the worst possible case, is to normalize such a criterion with respect to its maximum value. This is a different kind of adjustment, which does not produce a criterion that is centred at 1 for a null model (in fact, it does not rely on a specific null model), however it ensures that all the resulting values lie on instead. When applied to the MED and the RD it results in the new NMED and NRD criteria, which are not advised to be used alone, but jointly with their unnormalized counterparts, since the latter retain the most straightforward interpretation. In addition, not achieving its maximum for unrelated clusterings also hinders this approach for the RD, as shown in Figures 3 and 4, since it entails that the distribution of the NRD can be far from 1 under the null model. In contrast, the NMED distribution is indeed close to its upper bound of 1 in the null case, more so for higher sample sizes, although it must be pointed out that it seems more and more unlikely to reach this upper bound as the number of clusters increase.
In any case, it seems clear that the study of the distributions of all these criteria (the MED, the RD and the ARD) deserves further consideration, since its investigation through exhaustive enumeration or uniform sampling from the set of all possible confusion matrices has revealed some previous misconceptions and unexpected features.
Acknowledgments. The author acknowledges the support of the Spanish Ministerio de Economía y Competitividad grant MTM2016-78751-P and the Junta de Extremadura grant GR18016.
Appendix A: R function for misclassification error distance computation
Recall from Section 3.1 that, given a confusion matrix with , the main computational problem is to find
[TABLE]
where denotes the set of all possible permutations of elements. Here, for all (if any) and for . Fortunately, (9) is a linear sum assignment problem, whose solution can be efficiently found through the function solve_LSAP included in the R library clue (Hornik,, 2005, 2018). So once that library is loaded, with the command library(clue), a function to compute the MED from two equal-size vectors containing the cluster labels with respect to each clustering can be obtained through the following simple code:
med <- function(labels1, labels2){ n <- length(labels1) N <- table(labels1, labels2) r <- nrow(N) s <- ncol(N) if (r>s) N <- t(N); r <- nrow(N); s <- ncol(N) if (r<s) N <- rbind(N, matrix(0, nrow = s-r, ncol = s)) M <- matrix(rowSums(N), nrow = s, ncol = s) + matrix(colSums(N), nrow = s, ncol = s, byrow = TRUE) - 2 * N optimal.permutation <- solve_LSAP(M) result <- sum(M[cbind(seq_along(optimal.permutation), optimal.permutation)]) / (2 * n) return(result) }
Appendix B: The maximum Rand distance
Assuming as true the conjecture that there is always a maximizer of the RD of the form (7), here it is shown that the maximum value of the RD satisfies (8). First notice that, for a confusion matrix of the form (7), the function can be explicitly written as
[TABLE]
where . Then, the goal is to maximize with the constraint that the total number of data points is , which for the matrix (7) yields . The method of Lagrange multipliers yields the maximizer over real-valued choices of as , , but recall that the goal is to find the maximizer for nonnegative integer values of .
As in Section 3.4, write , with and . If (corresponding to the case ), then the real-valued maximizer is also integer-valued and leads to the maximizer and maximum value announced in (7) and (8) for .
If then the real-valued maximizer is rational, with all the coordinates having the same fractional part . Due to the total sum constraint, to find the integer-valued maximizer of (10) these fractional reminders need to be re-distributed into the coordinates , to make them integer, while at the same time trying to decrease the value of as less as possible with respect to the real-valued maximizer. To achieve this, first note that (10) is a concave function with the same curvature along every direction, so the least decrease with integer coordinates with respect to the real-valued maximum corresponds to rounding up to the least greater integer as few coordinates of as possible. Having fractional reminders of size , that entails that the integer-valued maximizer is found by rounding up exactly coordinates to the least greater integer (and rounding down the remaining coordinates). Finally, since play a symmetric role in (10), the only two cases to study comprise, either to round up and of the remaining coordinates, or to round up coordinates among (say, the first of them). The first of these cases yields , , , while the second one entails , , . It is easy to check that these two choices achieve the same value for , and the second one agrees with the form posited in (7) and (8), which is, thus, valid for arbitrary .
Appendix C: Uniform sampling from
A composition of a positive integer into parts is a representation in which are non-negative integers and the order of the summands matters. There are possible compositions of into parts, and it is easy to draw a composition at random (uniformly) without necessarily generating the set of all of them (see Nijenhuis and Wilf,, 1978, Chapters 5 and 6).
The entries of any confusion matrix constitute a composition of into parts. The additional conditions and for all , imposed in the definition of to ensure that the sizes of the associated compared clusterings match and , respectively, can be checked after arranging each drawn composition of into parts by columns (say) into an matrix, so uniform sampling from is guaranteed by rejection sampling.
For the simulations in Section 4.2 a sample of size was drawn from using the previous approach. The recorded rejection rate during the process was very low, approximately , so the sampling algorithm is very efficient. Moreover, that rejection rate can also be interpreted as an estimate of the proportion of compositions of into parts that cannot be converted (by columns) into a confusion matrix of and, since , that yields an estimate of for the cardinality of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aghaeepour et al. , (2013) Aghaeepour, N., Finak, G., The Flow CAP Consortium, The DREAM Consortium, Hoos, H., Mosmann, T.R., Brinkman, R., Gottardo, R. and Scheuermann, R.H. (2013). Critical assessment of automated flow cytometry analysis techniques. Nature Methods , 10 , 228–238.
- 2Anderson, (1935) Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Society , 59 , 2–5.
- 3Azizyan et al. , (2015) Azizyan, M., Chen, Y.-C., Singh, A. and Wasserman, L. (2015). Risk bounds for mode clustering. ar Xiv:1505.00482 .
- 4Ben-David, von Luxburg and Pál, (2006) Ben-David, S., von Luxburg, U. and Pál, D. (2006). A sober look at clustering stability. In G. Lugosi and H.-U. Simon, editors, Proceedings of the 19th Annual Conference on Learning Theory (COLT) , pp. 5–19. Springer, Berlin.
- 5Burkard, Dell’Amico and Martello, (2009) Burkard, R., Dell’Amico, M. and Martello, S. (2009) Assignment Problems . SIAM, Philadelphia.
- 6Chacón, (2015) Chacón, J.E. (2015). A population background for nonparametric density-based clustering. Statistical Science , 30 , 518–532.
- 7Chacón, (2019) Chacón, J.E. (2019). Mixture model modal clustering. Advances in Data Analysis and Classification , 13 , 379–404.
- 8Charon et al. , (2006) Charon, I., Denœud, L., Guénoche, A. and Hudry, O. (2006). Maximum transfer distance between partitions. Journal of Classification , 23 , 103–121.
