Accuracy Evaluation of Overlapping and Multi-resolution Clustering   Algorithms on Large Datasets

Artem Lutov; Mourad Khayati; Philippe Cudr\'e-Mauroux

arXiv:1902.01691·cs.DS·February 18, 2019

Accuracy Evaluation of Overlapping and Multi-resolution Clustering Algorithms on Large Datasets

Artem Lutov, Mourad Khayati, Philippe Cudr\'e-Mauroux

PDF

2 Repos

TL;DR

This paper evaluates the accuracy of overlapping and multi-resolution clustering algorithms on large datasets, proposing new metrics and optimizations to improve efficiency and effectiveness, with open-source implementations available.

Contribution

It introduces a new indexing technique for faster accuracy metric computation and extends existing metrics to better satisfy formal constraints on large datasets.

Findings

01

New indexing reduces runtime and memory usage

02

Metrics are faster than state-of-the-art on large datasets

03

Open-source C++ implementations available

Abstract

Performance of clustering algorithms is evaluated with the help of accuracy metrics. There is a great diversity of clustering algorithms, which are key components of many data analysis and exploration systems. However, there exist only few metrics for the accuracy measurement of overlapping and multi-resolution clustering algorithms on large datasets. In this paper, we first discuss existing metrics, how they satisfy a set of formal constraints, and how they can be applied to specific cases. Then, we propose several optimizations and extensions of these metrics. More specifically, we introduce a new indexing technique to reduce both the runtime and the memory complexity of the Mean F1 score evaluation. Our technique can be applied on large datasets and it is faster on a single CPU than state-of-the-art implementations running on high-performance servers. In addition, we propose several…

Tables5

Table 1. TABLE I: Accuracy evaluation of Low and High vs Ground-truth clusterings by Omega Index and Soft Omega Index.

Metrics \ Clusterings

Ground-truth

C1’: 1 2 3

C2’: 2 3 4

C3’: 3 4 1

C4’: 4 1 2

Low

C1: 1 2

C2: 3 4

High

C1: 1 2

C2: 2 3

C3: 3 4

C4: 4 1

Omega Index

0

Soft Omega Index

0

0.33

Table 2. TABLE II: Formal Constraints for Soft and original Omega Index.

Clusterings	Homogen.		Complet.		RagBag		SzQual.
Metrics	low	high	low	high	low	high	low	high
[Soft] Omega Index	0.247	0.282	0.244	0.311	0.4	0.4	0.804	0.804

Table 3. TABLE III: Formal Constraints for MF1 metric family.

Clusterings	Homogen.		Complet.		RagBag		SzQual.
Metrics	low	high	low	high	low	high	low	high
F1a	0.646	0.646	0.639	0.663	0.641	0.630	0.795	0.936
F1h	0.646	0.646	0.639	0.660	0.639	0.630	0.795	0.935
F1p	0.665	0.672	0.686	0.703	0.693	0.693	0.819	0.942

Table 4. TABLE IV: Formal Constraints for NMI, original GNMI and our GNMI.

Clusterings	Homogen.		Complet.		RagBag		SzQual.
Metrics	low	high	low	high	low	high	low	high
NMI	0.450	0.555	0.546	0.546	0.434	0.434	0.781	0.888
${GNMI}_{o r i g}$	0.512	0.598	0.572	0.632	0.417	0.397	0.808	0.877
GNMI	0.448	0.557	0.546	0.547	0.434	0.436	0.781	0.888

Table 5. TABLE V: Formal Constraints for Omega Index, MF1 and GNMI.

Metrics\Clusterings	Homogen.	Complet.	SzQual.
[Soft] Omega Index	+	+
F1h		+	+
F1p	+	+	+
GNMI	+		+

Equations28

O m e g a (C^{'}, C) = \frac{O b s ( C ^{'} , C ) - E x p ( C ^{'} , C )}{1 - E x p ( C ^{'} , C )} .

O m e g a (C^{'}, C) = \frac{O b s ( C ^{'} , C ) - E x p ( C ^{'} , C )}{1 - E x p ( C ^{'} , C )} .

O b s (C^{'}, C) = j = 0 \sum m i n (J^{'}, J) A_{j} / P,

O b s (C^{'}, C) = j = 0 \sum m i n (J^{'}, J) A_{j} / P,

E x p (C^{'}, C) = j = 0 \sum m i n (J^{'}, J) P_{j}^{'} P_{j} / P^{2},

E x p (C^{'}, C) = j = 0 \sum m i n (J^{'}, J) P_{j}^{'} P_{j} / P^{2},

O b s_{so f t} (C^{'}, C) = j = 0 \sum m a x (J^{'}, J) A n or m_{j} / P,

O b s_{so f t} (C^{'}, C) = j = 0 \sum m a x (J^{'}, J) A n or m_{j} / P,

E x p_{so f t} (C^{'}, C) = (j = 0 \sum J min P_{j}^{'} P_{j} + j = J min + 1 \sum m a x (J^{'}, J) P r e m_{j}) / P^{2},

E x p_{so f t} (C^{'}, C) = (j = 0 \sum J min P_{j}^{'} P_{j} + j = J min + 1 \sum m a x (J^{'}, J) P r e m_{j}) / P^{2},

F 1 a (C^{'}, C) = \frac{1}{2} (F_{C^{'}, C} + F_{C, C^{'}}),

F 1 a (C^{'}, C) = \frac{1}{2} (F_{C^{'}, C} + F_{C, C^{'}}),

F_{X, Y} = \frac{1}{∣ X ∣} x_{i} \in X \sum F 1 (x_{i}, g (x_{i}, Y)), g (x, Y) = {argmax_{y} F 1 (x, y) ∣ y \in Y},

F_{X, Y} = \frac{1}{∣ X ∣} x_{i} \in X \sum F 1 (x_{i}, g (x_{i}, Y)), g (x, Y) = {argmax_{y} F 1 (x, y) ∣ y \in Y},

F 1 h (C^{'}, C) = \frac{2 F _{C^{'}, C} F _{C, C^{'}}}{F _{C^{'}, C} + F _{C, C^{'}}} .

F 1 h (C^{'}, C) = \frac{2 F _{C^{'}, C} F _{C, C^{'}}}{F _{C^{'}, C} + F _{C, C^{'}}} .

pp r o b (m, c^{'}, c) = m /∣ c^{'} ∣ * m /∣ c ∣ = \frac{m ^{2}}{∣ c ^{'} ∣ * ∣ c ∣},

pp r o b (m, c^{'}, c) = m /∣ c^{'} ∣ * m /∣ c ∣ = \frac{m ^{2}}{∣ c ^{'} ∣ * ∣ c ∣},

f 1 (m, c^{'}, c) = 2 \frac{m /∣ c ^{'} ∣ * m /∣ c ∣}{m /∣ c ^{'} ∣ + m /∣ c ∣} = 2 \frac{m}{∣ c ^{'} ∣ + ∣ c ∣},

f 1 (m, c^{'}, c) = 2 \frac{m /∣ c ^{'} ∣ * m /∣ c ∣}{m /∣ c ^{'} ∣ + m /∣ c ∣} = 2 \frac{m}{∣ c ^{'} ∣ + ∣ c ∣},

I (C^{'} : C) = c^{'} in C^{'} \sum c in C \sum p (c^{'}, c) lo g_{2} \frac{p ( c ^{'} , c )}{p ( c ^{'} ) p ( c )},

I (C^{'} : C) = c^{'} in C^{'} \sum c in C \sum p (c^{'}, c) lo g_{2} \frac{p ( c ^{'} , c )}{p ( c ^{'} ) p ( c )},

N M I (C^{'}, C) = \frac{I ( C ^{'} : C )}{max ( H ( C ^{'} ) , H ( C ))} .

N M I (C^{'}, C) = \frac{I ( C ^{'} : C )}{max ( H ( C ^{'} ) , H ( C ))} .

H (X) = - x \in X \sum p (x) lo g_{2} p (x) .

H (X) = - x \in X \sum p (x) lo g_{2} p (x) .

e v s ma x = max (min (mb s^{'}, mb s), \frac{1}{r er r r r i s k}),

e v s ma x = max (min (mb s^{'}, mb s), \frac{1}{r er r r r i s k}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Accuracy Evaluation of Overlapping and Multi-resolution Clustering Algorithms

on Large Datasets ††thanks: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement 683253/GraphInt) and in part by the Swiss National Science Foundation under grant number CRSII2 147609.

Artem Lutov, Mourad Khayati and Philippe Cudré-Mauroux

eXascale Infolab, University of Fribourg—Switzerland

Email: {firstname.lastname}@unifr.ch

Abstract

Performance of clustering algorithms is evaluated with the help of accuracy metrics. There is a great diversity of clustering algorithms, which are key components of many data analysis and exploration systems. However, there exist only few metrics for the accuracy measurement of overlapping and multi-resolution clustering algorithms on large datasets. In this paper, we first discuss existing metrics, how they satisfy a set of formal constraints, and how they can be applied to specific cases. Then, we propose several optimizations and extensions of these metrics. More specifically, we introduce a new indexing technique to reduce both the runtime and the memory complexity of the Mean F1 score evaluation. Our technique can be applied on large datasets and it is faster on a single CPU than state-of-the-art implementations running on high-performance servers. In addition, we propose several extensions of the discussed metrics to improve their effectiveness and satisfaction to formal constraints without affecting their efficiency. All the metrics discussed in this paper are implemented in C++ and are available for free as open-source packages that can be used either as stand-alone tools or as part of a benchmarking system to compare various clustering algorithms.

Index Terms:

accuracy metrics, overlapping community evaluation, multi-resolution clustering evaluation, Generalized NMI, Omega Index, MF1, similarity of collections of sets

I Introduction

Clustering is a key component of many data mining systems with numerous applications including statistical analysis and the exploration of physical, social, biological and informational systems. This diversity of potential applications spawned a wide variety of network (graph) clustering algorithms proposed in the literature. It also led to specialized clustering algorithms in particular domains. Hence, the need to find the most suitable and best performing clustering algorithms for a given task became more dire, and the evaluation of the resulting clustering through proper metrics more important. Moreover, as modern systems often operate on very large datasets (consisting of billions of items potentially), the computational properties of the evaluation metrics become more important. In particular, performance-related constraints rapidly emerge when sampling the original large datasets is not desirable or not possible for a given use-case.

Clustering quality metrics can formally be categorized into two types: intrinsic and extrinsic metrics. Intrinsic quality metrics evaluate how the elements of each cluster are similar to each other and how they differ from elements in other clusters given a similarity metric such as modularity [1] or conductance [2]. Extrinsic quality metrics (also known as accuracy metrics) evaluate instead how the clusters are similar to the ground-truth (gold standard) clusters. In this paper, we focus on extrinsic metrics since they allow to identify clustering algorithms producing expected results, which is in practice often more useful than measuring the formation of optimal clusters by a given similarity metric. More specifically, we evaluate the similarity (proximity) between two collections of overlapping clusters (unordered sets) of elements, where

a) the collections have the same number of elements,

b) each element may be a member of multiple clusters and

c) each cluster may have several, non-mutual, best (in terms of the similarity value) matches in the other collection.

We call clustering the set of clusters resulting from an algorithm. Clusterings can be categorized as non-overlapping (crisp clustering, hard partitioning), overlapping (soft, fuzzy clustering) or, in some cases, multi-resolution (including hierarchical). Multi-resolution clusterings are considered when there is a need to simultaneously compare multiple resolutions (hierarchy levels) of the results against the ground-ground, where each resolution contains non/overlapping clusters as discussed further in Section IV-D. Non-overlapping clusterings can be seen as a special case of overlapping clusterings.

A large number of accuracy metrics were proposed in the literature to measure the clustering quality [3, 4, 5, 6, 7, 8]. Evaluation frameworks [9, 10, 11, 12, 13] and surveys [14, 15, 16, 17] were also introduced. Despite the large number of accuracy metrics proposed, very few metrics are applicable to overlapping clusters, causing many issues when evaluating such clusters (e.g., Adjusted Rand Index is used to evaluate overlapping clusters in [18], even though it is applicable to non-overlapping clusters only). Moreover, most of the quality metrics for overlapping clusters are not comparable to similar metrics for non-overlapping clusters (e.g., standard NMI [5] or modularity [1] versus some overlapping NMI [19] or overlapping modularity [20, 21, 22, 23] implementations), which complicates the direct comparison of the respective clustering algorithms.

Therefore, further research is required to develop accuracy metrics that are applicable to overlapping (and multi-resolution) clusterings, that satisfy tight performance constraints and, ideally, that are compatible with the results of standard accuracy metrics used for non-overlapping clustering. Finally, producing a single, easy to interpret value for the final clustering is of importance also, in order to help the user pick the most suitable clustering for a particular use-case and for potentially several accuracy metrics. This issue has been tackled through the formal constraints, introduced for example in [3, 6], and is further discussed in Section III.

To the best our knowledge, this is the first work discussing all state-of-the-art accuracy metrics applicable to overlapping clustering evaluation on large datasets (i.e., with more than $10^{7}$ elements). Being able to evaluate metrics on large datasets means—in our context—that the evaluation process should be at most:

a) quadraticin terms of runtime complexity and b) quasilinearin terms of memory complexity

with the number of elements considered. In addition, we also introduce in this paper a novel indexing technique to reduce the runtime complexity of the Mean F1 score (and a similar metric, NVD [9]) evaluation from $O(N\cdot(|C|+|C^{\prime}|))$ to $O(N)$ , where $N$ is the number of elements in the processed clusters, $|C|$ is the number of resulting clusters and $|C^{\prime}|$ is the number of ground-truth clusters. Finally, we propose extensions to the state-of-the-art accuracy metrics to satisfy more formal constraints (i.e., to improve their effectiveness) without sacrificing their efficiency. Efficient C++ implementations of

a) all discussed accuracy metrics and

b) their improved versions

are freely available online as open source utilities as listed in the sections devoted to each metric. All our accuracy metrics are also integrated into an open source benchmarking framework for clustering algorithms evaluation, Clubmark 111https://github.com/eXascaleInfolab/clubmark [13], besides being available as dedicated applications.

II Related Work

Related efforts can be categorized into three main groups, namely:

a) accuracy metrics for overlapping clustering evaluation (satisfying the complexity constraints mentioned above),

b) frameworks providing efficient implementations for accuracy evaluation and

c) formal constraints for accuracy metrics (discussed in Section III).

Accuracy Metrics for Overlapping Clusters

The Omega Index [4] is the first accuracy metric that was proposed for overlapping clustering evaluation. It belongs to the family of Pair Counting Based Metrics. It is a fuzzy version of the Adjusted Rand Index (ARI) [24] and is identical to the Fuzzy Rand Index [25]. We describe the Omega Index in Section IV-A.

Versions of Normalized Mutual Information (NMI) suitable for overlapping clustering evaluation were introduced as Overlapping NMI (ONMI)222https://github.com/eXascaleInfolab/OvpNMI [19] and Generalized NMI (GNMI) [26] and belong to the family of Information Theory Based Metrics. The authors of ONMI suggested to extend Mutual Information with approximations (introduced in [27]) to find the best matches for each cluster of a pair of overlapping clusterings. This approach allows to compare overlapping clusters, but unlike GNMI we introduce in Section IV-C, it yields values that are incompatible with standard NMI [5] results.

The Average F1 score is introduced in [7, 28] and a similar metric, NVD, is introduced in [9]. The Average F1 score belongs to the family of Cluster Matching Based Metrics and is described in Section IV-B.

Accuracy Measurement Frameworks

A toolkit333https://github.com/chenmingming/ParallelComMetric for the parallel measurement of the quality of non-overlapping clusterings on both distributed and shared memory machines is introduced in [9]. This toolkit performs the evaluation of several accuracy metrics (Average F1 score, NMI, ARI and JI) as well as some intrinsic quality metrics, and provides highly optimized parallel implementations of these metrics in C++ leveraging MPI (the Message Passing Interface) and Pthreads (POSIX Threads). Among its accuracy metrics, only Average F1 score is applicable to overlapping clusterings.

WebOCD [10] is an open-source RESTful web framework for the development, evaluation and analysis of overlapping community detection (clustering) algorithms. It comprises several baseline algorithms, evaluation metrics and preprocessing utilities. However, since WebOCD (including all its accuracy metrics) is implemented in pure Java as a monolithic framework, many existing implementations of evaluation metrics cannot be easily integrated into WebOCD without either being reimplemented in Java or modifying the framework architecture. A reimplementation of existing metrics is not always possible without a significant performance drop (especially when linking native, high-performance libraries such as Intel TBB, STL or Boost) and time investment.

CoDAR [11] is a framework for community detection algorithm evaluation and recommendation providing user-friendly interfaces and visualizations. Based on this framework, the authors also introduced a study of non-overlapping community detection algorithms on unweighed undirected networks [14]. Unfortunately, the framework URL provided in the paper refers to a forbidden page, i.e. the implementation is not available to the public anymore.

Circulo [12] is a framework for community detection algorithms evaluation. It executes the algorithms on preliminary uploaded input networks and then evaluates the results using several accuracy metrics and multiple intrinsic metrics.

III Formal Constraints on Clustering Evaluation Metrics

Four formal constraints for the accuracy metrics were introduced in [3] and shed light on which aspects of the quality of a clustering are captured by different metrics. Two of these constraints (Homogeneity, see Fig. 4 and Completeness, see Fig. 4) were originally proposed in [6] while the other two (Rag Bag, Fig. 4 and Size vs Quantity, Fig. 4) were newly introduced. Besides been intuitive, these four constraints were developed to be formally provable and to clarify the limitations of each metric. We list these constraints in the way they were presented in [3] and later use them to discuss various accuracy metrics. Each constraint is written as an inequality applied to a pair of clusterings, where a quality metric Q (some accuracy metric in our case) from the clustering on the right-hand side is assumed to be better than the one from the left-hand side. Ground-truth clusters are called categories for short in the following.

IV Accuracy Metrics for Overlapping Clustering

Accuracy metrics indicate how much one clustering (i.e., set of clusters) is similar to another (ground-truth) clustering. For each presented metric, we first give its original definition before proposing our extensions and optimizations. Then, we empirically evaluate the aforementioned four formal constraints on samples from [3] and given in Fig. 8-8 denoting the left clustering as Low and the right clustering as High for each sample. The constraints that are satisfied are marked in italics in the results tables for each discussed metric in the respective subsection. Cluster elements belonging to the same category (ground-truth cluster) in these figures are colored with the same color and texture, while the formed clusters (results) are shown with oval shapes.

IV-A Omega Index

IV-A1 Preliminaries

The Omega Index [4] is an ARI [24] generalization applicable to overlapping clusters. It is based on counting the number of pairs of elements occurring in exactly the same number of clusters as in the number of categories and adjusted to the expected number of such pairs. Formally, given the ground-truth clustering $C^{\prime}$ consisting of categories $c^{\prime}_{i}\in C^{\prime}$ and formed clusters $c_{i}\in C$ :

[TABLE]

The observed agreement is:

[TABLE]

where $J^{\prime}$ ( $J$ ) is the maximal number of categories (clusters) in which a pair of elements occurred, $A_{j}$ is the number of pairs of elements occurring in exactly $j$ categories and exactly $j$ clusters, and $P=N\cdot(N-1)/2$ is the total number of pairs given a total of $N$ elements (nodes of the network being clustered).

The expected agreement is:

[TABLE]

where $P^{\prime}_{j}$ ( $P_{j}$ ) is the total number of pairs of elements assigned to exactly $j$ categories (clusters).

IV-A2 Proposed Extension (Soft Omega Index)

The Omega Index evaluates overlapping clusterings by counting the number of pairs of elements present in exactly the same number of clusters as in the number of categories, which does not take into account pairs present in slightly different number of clusters. We propose to fix this issue by normalizing smaller number of occurrences of each pair of elements in all clusters of one clustering by the larger number of occurrences in another clustering as outlined on line 9 of Algorithm 1. The input data consists of two clusterings ( $grs,cls$ ), and the $rels$ hashmap relating the clusters to their elements (nodes) for each clustering. The updated computation of the observed agreement of pairs requires also to correct the expected agreement, which is performed on line 19.

Thus, OmegaSoft has the same definition as Eq. 1, except the observed agreement number is evaluated as:

[TABLE]

where $Anorm_{j}$ is the number of pairs of elements occurring in exactly $j^{\prime}$ and $j$ clusters of the clusterings and being weighted by $\min(j^{\prime},j)/\max(j^{\prime},j)$ .

The expected agreement is:

[TABLE]

where $Jmin=\min(J^{\prime},J)$ , $Prem_{j}=P_{j}$ if $\min(J^{\prime},J)=J^{\prime}$ and $Prem_{j}=P^{\prime}_{j}$ otherwise.

Note that for non-overlapping clusterings (i.e., when the membership in clusters equals to 1 $\implies J^{\prime}=J=1$ ), the Soft Omega Index is equivalent to the original Omega Index, which is equivalent to ARI.

IV-A3 Evaluation and Constraints Matching

A counterexample outlining the issue of the Omega Index when discarding partially matching pairs of elements is shown in Table I, where each of the four categories (C1’-C4’) consists of 3 elements and the total number of elements is 4 (#1-#4). The first clustering algorithm has a Low accuracy and discovers only two clusters as shown in Table I. The second clustering algorithm (High) performs much better discovering all four clusters but not all elements of the respective categories as shown in Table I. The original Omega Index fails to discriminate these cases yielding 0 for both cases, whereas the Soft Omega Index clearly differentiates the more accurate solution.

The empirical satisfaction of the formal constraints for both versions of Omega Index is given in Table II and discussed below in Section IV-D. The computational complexity $O(N^{2})$ and the memory complexity is $O(|C|+|C^{\prime}|)\approx$ 444On average, the number of clusters in most real-world networks consisting of $N$ nodes is $\sqrt{N}$ $O(\sqrt{N})$ for both implementations, where $N$ is the number of elements in the clustering. Implementations of both the original and Soft Omega Index are provided in the open source xmeasures555https://github.com/eXascaleInfolab/xmeasures utility and are available for free.

IV-B Mean F1 Score

IV-B1 Preliminaries

The Average F1 score (F1a) is a commonly used metric to measure the accuracy of clustering algorithms [7, 28, 29]. F1a is defined as the average of the weighted F1 scores [30] of

a) the best matching ground-truth clusters to the formed clusters and b) the best matching formed clusters to the ground-truth clusters.

Formally, given the ground-truth clustering $C^{\prime}$ consisting of clusters $c^{\prime}_{i}\in C^{\prime}$ (called categories) and clusters $c_{i}\in C$ formed by the evaluating clustering algorithm:

[TABLE]

where

[TABLE]

where $F1(x,y)$ is the F1 score of the respective clusters.

IV-B2 Proposed Extensions (F1h and F1p)

The F1a definition yields non-indicative values of $F1a\in[0,0.5]$ when evaluating a large number of clusters. In particular, for clusters formed by taking all possible combinations of the nodes, $F1a>0.5\,(F1_{C^{\prime},C}=1$ since for each category there exists the exactly matching cluster, $F1_{C,C^{\prime}}\rightarrow 0$ since majority of clusters have low similarity to the categories $)$ . To address this issue, we suggest to use the harmonic mean instead of the arithmetic mean (average). We introduce the harmonic F1 score (F1h) as:

[TABLE]

$F1h\leq F1a$ since the harmonic mean cannot be larger than the arithmetic mean. In our case, $F_{C^{\prime},C}$ can be interpreted as a recall and $F_{C,C^{\prime}}$ as a precision of the evaluating clustering $C$ and the ground-truth clustering $C^{\prime}$ .

$F1h$ is more indicative than $F1a$ but both measures do not satisfy the Homogeneity constraint, as they penalize local best matches too severely. We propose to evaluate the probability of the local best matches rather than the F1 score to address the outlined issue. Our new metric $F1p$ is the harmonic mean (i.e. F1 measure) of the average over each clustering of the best local probabilities for each cluster. $F1p$ corresponds to the expected probability of the best match of the clusterings unlike $F1h$ , which corresponds to the expected worst-case of the best match in the clusterings. Formally, $F1p$ is evaluated similarly to $F1h$ , except that the local matching function $pprop$ given in Eq. 9 replaces $f1$ given in Eq. 10.

[TABLE]

where $m$ is the contribution of matched elements between the cluster $c$ and category $c^{\prime}$ . The notations of contribution and $|x|$ (size of the cluster $x$ ) vary for the overlapping and multi-resolution clusterings and are discussed in Section IV-D. For multi-resolution and non-overlapping clusterings, $|x|$ is simply equal to the number of elements in the cluster $x$ , and the contribution of each element is equal to 1. For overlapping clusterings, $|x|$ is equal to the total contribution of elements in cluster $x$ , where each element $x_{i}$ contributes the value $1$ /shares( $x_{i}$ ) given that $x_{i}$ is a member of the number shares( $x_{i}$ ) of (overlapping) clusters.

IV-B3 Optimizations (Efficient Indexing) for the F1 Metric Family

We propose an efficient indexing technique to reduce the computational complexity of computing the Mean F1 score metric family ( $F1a,F1h$ and $F1p$ ). Our technique is based on dedicated data structures described below. When loading the clusterings, we create a $rels$ hashmap relating the clusters to their elements (nodes) for each clustering. Besides the member nodes, our cluster data structure holds also:

a) an accumulator $ctr$ for the matching contributions of the member nodes together with the pointer to the originating cluster from which these contributions are formed, and

b) the local contributions $cont$ for all members nodes of the cluster.

This data structure allows to evaluate the metrics using a single pass over all members of all clusters. The content of the $ctr$ attribute is reset on line 9 in Algorithm 2 when adding a value of the contribution together with a distinct cluster pointer from the already stored one. The main procedure to evaluate the aforementioned best matches for the clustering $cls$ is listed in Algorithm 2 considering

a) the fmatch matching function (f1 or pprob given in Eq. 9-10) parameterized as $prob$ argument, and

b) the overlapping or multi-scale clusters evaluation semantics parameterized as $ovp$ .

IV-B4 Evaluation and Constraints Matching

Our new indexing technique reduces the computational complexity of Mean F1 Score evaluation from $O(N\cdot(|C|+|C^{\prime}|))$ [9] $\simeq$ \footrefftn:clsnum $O(N\sqrt{N})$ to $O(N\cdot s)\simeq O(N)$ , where $N$ is the number of elements in the processing clusters and the constant $s$ is the average membership of the elements, which typically is $\in[1,2)$ for overlapping clusterings. Implementations of all $MF1$ metrics ( $F1a,F1h,F1p$ ) are provided in the open source xmeasures\footrefftn:xmsrc utility and are available for free.

The empirical satisfaction of the formal constraints for all $MF1$ metrics is given in Table III and discussed in Section IV-D.

IV-C Generalized NMI

IV-C1 Preliminaries

Generalized NMI (GNMI)666https://github.com/eXascaleInfolab/GenConvMI [26] uses a stochastic process to compare overlapping clusterings and feeds the random variables of the process into the standard definition of mutual information (MI) [31]. MI is evaluated by taking all pairs of clusters from the formed and ground-truth clusterings respectively and counts the number of common elements in each pair. Formally, given the ground-truth clustering $C^{\prime}$ consisting of clusters $c^{\prime}\in C^{\prime}$ and the formed clusters $c\in C$ , mutual information is defined as:

[TABLE]

where $p(c^{\prime},c)$ is the normalized number of common elements in the pair of (category, cluster), $p(c^{\prime})$ and $p(c)$ is the normalized number of elements in the categories and formed clusters respectively. The normalization is performed using the total number of elements in the clustering, i.e. the number of nodes in the input network.

Normalized Mutual information (NMI) [5] performs normalization of MI by maximum value, arithmetic or geometric mean of the unconditional entropies $H(C^{\prime})$ and $H(C)$ of the clusterings. Normalization by the maximum value of the entropies is the standard approach, which is also considered as the most discriminative one [26, 19]:

[TABLE]

The unconditional entropy $H(X)$ [32] of clusters $x\in X$ is:

[TABLE]

IV-C2 Proposed Extension and Optimizations

The GNMI approach of using a stochastic process provides the only known way to evaluate standard NMI for overlapping clusterings. The original GNMI implementation\footrefftn:gecmi0 uses Intel’s TBB library (lightweight treads) to execute the stochastic process on all available CPUs efficiently. However, the original implementation has several shortcomings, which makes it inapplicable to large datasets:

•

its hard-coded maximal number of stochastic events (successful samples) EVCOUNT_THRESHOLD is not adequate for handling both small and large datasets. Moreover, the original value is too small for large datasets;

•

its fully random sampling of the cluster elements has a too high computational complexity on large datasets to produce results considering a reasonably small evaluation error (default value is 0.01);

•

the two aforementioned points cause significant errors on small datasets and loss of convergence on large datasets while consuming significant computational resources.

We optimized and extended the original version addressing these issues in the following ways. First, instead of the hard-coded EVCOUNT_THRESHOLD, we dynamically evaluate the maximal number of stochastic events as:

[TABLE]

where $mbs^{\prime}$ and $mbs$ are the total membership of the elements in the categories and clusters respectively, $rerr$ is the admissible error and $rrisk$ is the complement of the resulting confidence; $rerr$ and $rrisk$ are specified as input arguments with a default value equal to $0.01$ .

Second, we extended the original try_get_sample procedure with the weighed adaptive sampling given in Algorithm 3. The original version randomly takes the first node (cluster element) among all nodes in the clusterings as shown on line 2 and then applies $mixer$ until it returns false or all nodes are traversed in a randomized order. We traverse the nodes located in the same cluster of the formed ( $c$ ) or ground-truth ( $g$ ) clusterings on line 7 and weight the shared nodes inversely to their membership. The weighting is performed to discount the contribution (importance) of frequent nodes compared to rare nodes since we traverse only a fraction of all nodes and they may have varying membership $\geq 1$ . Indexes on the matched ground-truth and formed clusters are stored in the $mixer$ and their matching probability is returned explicitly as $importance$ .

In addition, we performed some technical optimizations to reduce the number of memory allocations and copies, calls to external functions, etc. Our extended implementation of GNMI777https://github.com/eXascaleInfolab/GenConvNMI is open source and available for free.

IV-C3 Evaluation and Constraints Matching

The empirical satisfaction of the formal constraints for $NMI$ , the original $GNMI_{orig}$ and our $GNMI$ implementations is given in Table IV and discussed below in Section IV-D.

Since GNMI-s yields stochastic results, we report the median value over 5 runs with the same default values of the error and risk arguments. As the table clearly shows, our GNMI implementation is much more accurate than the original one and yields values equal to the original NMI within the specified admissible error (0.01) on non-overlapping clusterings. The empirical evaluation on

a) the synthetic datasets formed using the LFR888https://github.com/eXascaleInfolab/LFR-Benchmark_UndirWeightOvp [33] framework and

b) the large real-world networks with ground-truth999https://snap.stanford.edu/data/#communities introduced in [29]

show that our implementation is one order of magnitude faster on datasets with $10^{4}$ nodes and two orders of magnitude faster on datasets with $10^{6}$ nodes than the original GNMI implementation. The actual computational complexity of both GNMI implementations depends on the structure of the clusters and on the number of overlaps in the clusterings. Moreover, for very dissimilar clusterings the evaluation might not converge at all, which is a disadvantage of the stochastic evaluation. The worst case computational complexity for our GNMI implementation is $O(N\cdot s\cdot|\bar{c}|\cdot|C|)\approx O(N\cdot|\bar{c}|\cdot|C|)$ while it is $O($ EVCOUNT_THRESHOLD $\cdot N\cdot|C|)$ for the original GNMI, where $N$ is the number of elements in the evaluating clusterings, $|C|$ is the number of clusters and $|\bar{c}|$ is the average size of the clusters.

IV-D Discussion

When evaluating clusterings, it is important to

a) distinguish overlapping from multi-resolution (including hierarchical) clusterings and

b) consider the limitations of the applied accuracy metric

in order to produce meaningful results. First, we discuss the differences when handling overlapping clusterings versus multi-resolution clusterings having non-overlapping clusters on each resolution and how they affect accuracy evaluations.

In the case of overlapping clusterings, a node $x_{i}$ can be shared between $s$ clusters and has equal membership in each of them (e.g., a person may spend time for several distinct hobbies but the more hobbies are involved the less time the person devotes to each one). Thus, the membership contribution $x_{i}$ to each of the clusters is $1/s$ . In case of non-overlapping, multi-resolution clusterings, a node $x_{i}$ may be a member of $s$ nested clusters taken at different resolutions (granularities), but here $x_{i}$ fully belongs to each of these clusters having a membership contribution equal to $1$ (e.g., a student tacking a course can be a full member of the course, as well as of the department offering the course and of the whole university). These distinct semantics for the elements shared in a clustering are represented by the $ovp$ argument in our xmeasures\footrefftn:xmsrc utility for all MF1 metrics.

We summarize the advantages and limitations of the various metrics discussed in this paper as follows:

Omega Index pros: its values are not affected by the number of clusters (unlike NMI) and have an intuitive interpretation (0 means equal quality to a random clustering).

Omega Index cons: it performs purely when applied to multi-resolution clusterings and has the highest computational complexity among all considered metrics ( $O(N^{2})$ ).

MF1 pros: it has the lowest computational complexity (linear when using our indexing technique).

MF1 cons: it evaluates the best-matching clusters only, which gives an unfair advantage to the clusterings with larger numbers of clusters (which is partially addressed by the application of the harmonic mean in $F1h$ and $F1p$ instead of the average).

GNMI pros: it parallelizes very well and inherits NMI’s pros, namely it evaluates the full matching of all clusters (unlike MF1). Also, it is well-grounded formally in Information Theory.

GNMI cons: the convergence in the stochastic implementation is not guaranteed (though loss of convergence typically indicates a relevance close to zero); the stochastic process yields non-deterministic results and is computationally heavy; its execution time is hard to estimate. In addition, GNMI inherits the NMI’s cons, namely the results depends on the number of clusters being evaluated and increase up to $\approx 0.3$ for large numbers of clusters.

Empirical evaluation of the formal constraints satisfaction for all discussed metrics on the original intuitive samples from [3] is given in Table V. As it shown in the table, none of the metrics satisfies the RagBag constraint, which is essential in practice when evaluating the multi-resolution or hierarchical clusterings since the more fine-grained (higher resolutions) are expected to have less noisy structure. Otherwise, the proposed $F1p$ metric performs the best according to the empirical satisfaction of the formal constraints having the lowest computational complexity (linear on the number of elements being clustered) compared to GNMI and Omega Index.

V Conclusions

In this paper, we discussed the state-of-the-art accuracy metrics applicable to overlapping clustering evaluations on large datasets and introduced several optimizations and extensions of the discussed metrics. In particular, we introduced an efficient indexing technique to speedup the evaluation of Mean F1 score from $O(N\sqrt{N})$ to $O(N)$ , where $N$ is the number of elements in the clustering. We proposed an adaptive sampling strategy for GNMI, which not only speeds up the evaluation by orders of magnitude but also improves the precision of the metric. In addition, we proposed two extensions of the Average F1 score ( $F1a$ ), namely $F1h$ and $F1p$ . $F1h$ addresses the issues of the loss of indicativity of $F1a$ in the range $[0,0.5]$ while $F1p$ empirically improves the satisfaction of the formal constraints we considerwithout sacrificing efficiency. We also proposed an extension of the Omega Index called Soft Omega Index, which is equivalent to the original Omega Index evaluating non-overlapping clusterings and yields more discriminative results for overlapping clusterings due to the fact that it considers partial matches of pairs of elements.

Besides the proposed optimizations and extensions of the accuracy metrics, we discussed formal constraints for each metric, as well as their applicability and limitations for specific cases in overlapping and multi-resolutions clusterings. Our analysis of the metrics should provide insight for their future usage and should hopefully help identify the best performing clustering algorithm for particular user’s needs or tasks.

We freely provide implementations of all discussed metrics, and are also integrating them into an open source benchmarking framework for clustering algorithms evaluation, Clubmark \footrefftn:bmsrc.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Phys. Rev. E , vol. 69, no. 2, p. 026113, 2004.
2[2] R. Kannan, S. Vempala, and A. Vetta, “On clusterings: Good, bad and spectral,” J. ACM , vol. 51, no. 3, pp. 497–515, May 2004.
3[3] E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo, “A comparison of extrinsic clustering evaluation metrics based on formal constraints,” Inf. Retr. , vol. 12, no. 4, Aug. 2009.
4[4] L. M. Collins and C. W. Dent, “Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions,” Multivariate Behavioral Research , vol. 23, no. 2, pp. 231–242, 1988.
5[5] L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identification,” Journal of Statistical Mechanics: Theory and Experiment , vol. 9, p. 8, Sep. 2005.
6[6] A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” ser. EMNLP-Co NLL’07, pp. 410–420.
7[7] J. Yang and J. Leskovec, “Overlapping community detection at scale: A nonnegative matrix factorization approach,” ser. WSDM ’13. ACM, pp. 587–596.
8[8] H. Rosales-Méndez and Y. Ramírez-Cruz, “Cice-bcubed: A new evaluation measure for overlapping clustering algorithms,” ser. CIARP’13.