Metrics matter in community detection

Arya D. McCarthy; Tongfei Chen; Rachel Rudinger; David W.; Matula

arXiv:1901.01354·cs.SI·May 22, 2020

Metrics matter in community detection

Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, David W., Matula

PDF

TL;DR

This paper critically examines the use of normalized mutual information (NMI) for evaluating community detection, highlighting its biases and proposing more robust alternatives like one-sided AMI for fair assessment.

Contribution

It analyzes the limitations of NMI and related metrics, providing equivalences under random models and recommending improved evaluation methods for community detection.

Findings

01

NMI exaggerates performance on weak communities

02

One-sided AMI offers a more robust evaluation metric

03

Different metrics can be equivalent under certain models

Abstract

We present a critical evaluation of normalized mutual information (NMI) as an evaluation metric for community detection. NMI exaggerates the leximin method's performance on weak communities: Does leximin, in finding the trivial singletons clustering, truly outperform eight other community detection methods? Three NMI improvements from the literature are AMI, rrNMI, and cNMI. We show equivalences under relevant random models, and for evaluating community detection, we advise one-sided AMI under the $M_{all}$ model (all partitions of $n$ nodes). This work seeks (1) to start a conversation on robust measurements, and (2) to advocate evaluations which do not give "free lunch".

Equations37

\!\!\!\mathrm{AMI}(\mathcal{C},\mathcal{T})=\frac{I(\mathcal{C},\mathcal{T})\leavevmode\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-B(\mathcal{C},\mathcal{T})}{\leavevmode\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}M(\mathcal{C},\mathcal{T})\leavevmode\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-B(\mathcal{C},\mathcal{T})}\textrm{,}

\!\!\!\mathrm{AMI}(\mathcal{C},\mathcal{T})=\frac{I(\mathcal{C},\mathcal{T})\leavevmode\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-B(\mathcal{C},\mathcal{T})}{\leavevmode\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}M(\mathcal{C},\mathcal{T})\leavevmode\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-B(\mathcal{C},\mathcal{T})}\textrm{,}

B (C, T)

B (C, T)

M (C, T)

H (C) = - C \in C \sum Pr (C) lo g Pr (C),

H (C) = - C \in C \sum Pr (C) lo g Pr (C),

H (C) = - C \in C \sum \frac{∣ C ∣}{N} lo g \frac{∣ C ∣}{N} .

H (C) = - C \in C \sum \frac{∣ C ∣}{N} lo g \frac{∣ C ∣}{N} .

I (C, T) = C \in C \sum T \in T \sum \frac{∣ C \cap T ∣}{N} lo g \frac{N ∣ C \cap T ∣}{∣ C ∣ ∣ T ∣} .

I (C, T) = C \in C \sum T \in T \sum \frac{∣ C \cap T ∣}{N} lo g \frac{N ∣ C \cap T ∣}{∣ C ∣ ∣ T ∣} .

NMI (C, T) = \frac{I ( C , T )}{M ( C , T )} .

NMI (C, T) = \frac{I ( C , T )}{M ( C , T )} .

rNMI (C, T) = NMI (C, T) - E_{C^{'}} [NMI (C^{'}, T)] .

rNMI (C, T) = NMI (C, T) - E_{C^{'}} [NMI (C^{'}, T)] .

rrNMI (C, T) = \frac{rNMI ( C , T )}{rNMI ( T , T )} .

rrNMI (C, T) = \frac{rNMI ( C , T )}{rNMI ( T , T )} .

cNMI (C, T) = \frac{rNMI ( C , T ) + rNMI ( T , C )}{rNMI ( C , C ) + rNMI ( T , T )} .

cNMI (C, T) = \frac{rNMI ( C , T ) + rNMI ( T , C )}{rNMI ( C , C ) + rNMI ( T , T )} .

AMI (C, T) = \frac{I ( C , T ) - E _{C^{'}, T^{'}} [ I ( C ^{'} , T ^{'} ) ]}{C ^{'} , T ^{'} max I ( C ^{'} , T ^{'} ) - E _{C^{'}, T^{'}} [ I ( C ^{'} , T ^{'} ) ]},

AMI (C, T) = \frac{I ( C , T ) - E _{C^{'}, T^{'}} [ I ( C ^{'} , T ^{'} ) ]}{C ^{'} , T ^{'} max I ( C ^{'} , T ^{'} ) - E _{C^{'}, T^{'}} [ I ( C ^{'} , T ^{'} ) ]},

B_{all}^{1} (C, T) = E_{C^{'} \sim M_{all} (T)} [I (C^{'}, T)] .

B_{all}^{1} (C, T) = E_{C^{'} \sim M_{all} (T)} [I (C^{'}, T)] .

M_{all} (C, T) = lo g N .

M_{all} (C, T) = lo g N .

AMI_{all}^{1} (C, T) = \frac{I ( C , T ) - E _{C^{'}} [ I ( C ^{'} , T )]}{lo g N - E _{C^{'}} [ I ( C ^{'} , T )]},

AMI_{all}^{1} (C, T) = \frac{I ( C , T ) - E _{C^{'}} [ I ( C ^{'} , T )]}{lo g N - E _{C^{'}} [ I ( C ^{'} , T )]},

E_{C^{'}, T^{'}} [I (C^{'}, T^{'})]

E_{C^{'}, T^{'}} [I (C^{'}, T^{'})]

E_{C^{'} \sim M_{perm} (C)} [F (C^{'}, T_{1})] = E_{C^{'} \sim M_{perm} (C)} [F (C^{'}, T_{2})] .

E_{C^{'} \sim M_{perm} (C)} [F (C^{'}, T_{1})] = E_{C^{'} \sim M_{perm} (C)} [F (C^{'}, T_{2})] .

E_{C^{'}} [I (C^{'}, T_{1})] = E_{C^{'}} [I (C^{'}, T_{2})] = K .

E_{C^{'}} [I (C^{'}, T_{1})] = E_{C^{'}} [I (C^{'}, T_{2})] = K .

E_{C^{'}, T^{'}} [I (C^{'}, T^{'})]

E_{C^{'}, T^{'}} [I (C^{'}, T^{'})]

= E_{T^{'}} [K]

= K .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Partially conducted at ]Southern Methodist University.

Metrics matter in community detection

Arya D. McCarthy

[email protected]

[

Tongfei Chen

Rachel Rudinger

Department of Computer Science

Johns Hopkins University

David W. Matula

Department of Computer Science and Engineering

Southern Methodist University

Abstract

We present a critical evaluation of normalized mutual information (NMI) as an evaluation metric for community detection. NMI exaggerates the leximin method’s performance on weak communities: Does leximin, in finding the trivial singletons clustering, truly outperform eight other community detection methods? Three NMI improvements from the literature are AMI, rrNMI, and cNMI. We show equivalences under relevant random models, and **for evaluating community detection, we advise one-sided AMI under the $\mathbb{M}_{\text{all}}$ model **(all partitions of $n$ nodes). This work seeks

(1) to start a conversation on robust measurements

(2) to advocate evaluations which do not give “free lunch”

pacs:

89.75.Hc Networks and genealogical trees – 05.10.-a Computational methods in statistical physics and nonlinear dynamics – 02.10.Ox Combinatorics; graph theory – 87.23.Ge Dynamics of social systems

††preprint: Preprint

I Introduction

Unsupervised algorithms—like those for community detection (CD), investigated in many discipline for half a century (Matula, 1977)—present a challenge for appraisal. In CD, we circumvent the problems of intrinsic measures by using external evaluation tasks. Practitioners apply CD methods to benchmark graphs containing “ground truth” communities, then compute an agreement measure to determine how well those communities are recovered. These measures differ from typical classification accuracy because there are no specific labels (e.g. no notion of “Cluster 2”)—only groups of similar entities.

The popular measure in CD is normalized mutual information (NMI). Its theoretical flaws have been noted Lai and Nardini (2016); Peel et al. (2017); Vinh et al. (2010); Zhang et al. (2015). Particularly relevant is the non-homogeneity of the measure: NMI awards credit for low-information guessing Peel et al. (2017). This deficiency has demonstrable implications for method selection, which we later show using the leximin method as an example.

A sequence of proposed improvements (Zhang, 2015; Zhang et al., 2015) in the CD community led to the recent corrected NMI (cNMI) Lai and Nardini (2016). A common, older measure, adjusted mutual information (AMI) Vinh et al. (2010), has garnered recent attention in CD McCarthy (2017); Peel et al. (2017). AMI augments NMI’s consistent upper bound (1.0) with a consistent zero expectation to adjust for chance clusterings:111Negative AMI indicates worse-than-chance clusterings.

[TABLE]

where $B(\mathcal{C},\mathcal{T})$ is a baseline function that is used to adjust the metric to zero expectation, and $M(\mathcal{C},\mathcal{T})$ guarantees consistent upper bound of the metric. In previous literature, one common incarnation of these these functions could be

[TABLE]

Many variants exist, by changing the definitions of the $B$ and $M$ functions.

Note the expectation operator in Equation 2. Over what distribution is the expected value computed? This is called random models in recent work Gates and Ahn (2017). The literature has implicitly computed expectations over $\mathbb{M}_{\mathrm{perm}}$ : all partitions of the same class (or cluster-size pattern) as the observation (Zhang, 2015; Zhang et al., 2015; Lai and Nardini, 2016; Vinh et al., 2009, 2010). We argue that more appropriate for CD is an expectation over $\mathbb{M}_{\text{all}}$ instead of $\mathbb{M}_{\text{perm}}$ : all partitions of $n$ nodes. We also advise one-sided random models for comparing against a fixed ground truth.

The main contributions of this work are

•

With the leximin method as an example, we show the need for an improved evaluation metric in CD.

•

We identify an evaluation function which better matches the community detection problem domain.

•

We advocate for the use of the adjusted metric AMI, with slight modifications from its proposed form.

•

We provide thorough analysis of the relationships between AMI and other evaluation functions.

II Community Detection

A number of tasks on graphs ask that you partition the graph’s nodes to maximize a score function. Situated between the microscopic, node level and the macroscopic, whole-graph level, these partitions form a mesoscopic structure—be it a core–periphery separation, a graph coloring, or our focus: community detection. Community detection has been historically ill-defined, though the intuition is to collect nodes with high interconnectivity (or edge density) into communities with low edge density between them. The task is analogous to clustering, which groups points that are near one another in some fashioned metric space.

To bring some rigor to the task, we often choose modularity Newman (2006a) as the objective function. Modularity balances the objectives of many edges within clusters (the first term) and avoiding giant, all-encompassing communities (the second term). Directly maximizing modularity is NP-hard (Brandes et al., 2006), as is approximating it to within any constant factor (Dinh et al., 2015). For this reason, most community detection methods are heuristics that approximate modularity.

Still, modularity isn’t the be-all, end-all in community detection. In real-world data, we may have a notion of the communities already, and in these cases we only value modularity insofar as it may guide us toward these communities. To assess whether our formulation of community detection matches our needs, we perform extrinsic evaluation against a known ground truth. While our focus is community detection, our results are sufficiently general for any mesoscopic pattern discovery on graphs.

III Preliminaries

III.1 Partition structures

A partition or clustering $\mathcal{C}$ of set $X$ can be written as $\mathcal{C}=\{C_{1},\cdots,C_{k}\}$ , where each community or cluster $C_{i}$ is a subset of $X$ , their union equals $X$ , i.e. $\bigcup_{i=1}^{k}C_{i}=X$ , and every pair is mutually disjoint, i.e., $\forall$ $i$ , $j$ , $C_{i}\cap C_{j}=\varnothing$ .

We reuse some terminology on partition structures from Kingman (1978). A partition $\Lambda$ of integer $N$ is a multiset of integers $\{\lambda_{1},\cdots,\lambda_{k}\}$ , whose sum is $N$ , commonly arranging the elements in descending order. A partition $\mathcal{C}=\{C_{1},\cdots,C_{k}\}$ of set $X$ is said to have $\mathrm{shape}(\mathcal{C})\triangleq\{|C_{1}|,\cdots,|C_{k}|\}$ (a multiset), where $\sum_{i=1}^{k}|C_{i}|=|X|=N$ .222This shape has alternatively been called a class or decomposition pattern (Hauer and Kondrak, 2016). That is, the shape is an integer partition whose elements map to cluster sizes.

III.2 Random models

And what, in fact, is the domain of $\mathcal{C}$ , which we call $\mathbb{M}$ ? For community detection, it will be the set of all partitions on $N$ nodes. After all, any of these partitions represents a valid community structure, even if it is a poor community structure by intrinsic metrics like modularity. Nevertheless, other random models are relevant to our discussion, restricting the set to those with a fixed number of communities or with a fixed sequence of sizes.

Previous work argued that clustering similarity should be computed in the context of a random ensemble of clusterings Vinh et al. (2009); Hubert and Arabie (1985). What context of clusterings should be chosen? Gates and Ahn (2017) argue that the question above is usually ignored in CD research and more broadly. To remedy this, they proposed three random models: $\mathbb{M}_{\text{perm}}$ , $\mathbb{M}_{\text{all}}$ and $\mathbb{M}_{\text{num}}$ . Given a clustering $\mathcal{C}$ over set $X$ whose size is $N$ , these random models are:

$\mathbb{M}_{\text{all}}$ : This is the random model that spans all clusterings of $N$ elements: $\mathbb{M}_{\text{all}}(\mathcal{C})=\{\mathcal{C}^{\prime}\mid\sum_{C\in\mathcal{C}^{\prime}}|C|=N\}$ . This is the random model that we advocate instead of the more common $\mathbb{M}_{\text{perm}}$ ; our justifications will be elaborated below. 2. 2.

$\mathbb{M}_{\text{perm}}$ (permutation model): The shape of the clusterings are fixed, and all random clusterings are generated by shuffling the elements between the fixed-size clusters. Formally, $\mathbb{M}_{\text{perm}}(\mathcal{C})=\{\mathcal{C}^{\prime}\mid{\rm shape}(\mathcal{C}^{\prime})={\rm shape}(\mathcal{C})\}$ . However, despite being widely used for evaluation (Zhang, 2015; Zhang et al., 2015; Lai and Nardini, 2016; Vinh et al., 2009, 2010), the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution vary drastically (Hubert and Arabie, 1985; Gates and Ahn, 2017). 3. 3.

$\mathbb{M}_{\text{num}}$ : The random model that contains all clusterings of the same number of clusters: $\mathbb{M}_{\text{num}}(\mathcal{C})=\{\mathcal{C}^{\prime}\in\mathbb{M}_{\text{all}}(\mathcal{C})\mid|\mathcal{C}^{\prime}|=|\mathcal{C}|\}$ .

These sets of clusterings satisfy the following containment order: $\mathbb{M}_{\text{perm}}(\mathcal{C})\subseteq\mathbb{M}_{\text{num}}(\mathcal{C})\subseteq\mathbb{M}_{\text{all}}(\mathcal{C})$ .333This abuse of notation blurs the distinction between random models and the spaces over which they define their probabilities. We have restricted ourselves to uniform distributions over the spaces, so we comfortably believe that the abuse will not confuse.

III.3 Information theory

After a method has followed clues through the space of partitions, we want to pull off its blindfold to see how it did. Classification accuracy or $F_{1}$ score won’t cut it: When comparing to the ground truth, there are no specific labels (e.g. no notion of “Cluster 2”)—only groups of like entities. We settle for a measure of similarity in the groupings, quantifying how much the computed partition tells us about the ground truth. A popular choice is the normalized mutual information (NMI) between the prediction and the ground truth. To understand it (and its flaws), we must review basic information theory concepts.

Mutual information depends on an understanding of entropy, which captures the uncertainty in a random variable—in this case, the category labels. For a clustering $\mathcal{C}=\{C_{1},C_{2},\dots,C_{k}\}$ , the entropy is

[TABLE]

For community detection, we compute it with maximum likelihood estimation. The probability of membership in a given community is proportional to its size, so the clustering’s entropy—a measure of its uncertainty—is

[TABLE]

Mutual information (MI) measures how well knowing one distribution shrinks our uncertainty about another. Again using maximum likelihood estimation, the MI between a clustering $\mathcal{C}$ and its “ground truth” clustering $\mathcal{T}$ . is

[TABLE]

It can be understood as the Kullback–Leibler divergence from the joint distribution $\Pr(C_{c},T_{t})$ to the product of its marginal distributions $\Pr(C_{c})\Pr(T_{t})$ . Like entropy, the value is nonnegative. It can be normalized by dividing by an upper bound. We will contrast the choices for this bound in section V.

IV What We’ve Been Doing Wrong

Community detection is historically evaluated with NMI:

[TABLE]

The quantity is unitless: by normalizing, we divide nats by nats or bits by bits, based on our choice of logarithm. Vinh et al. (2009) and Zhang (2015) both note a finite size effect: the average score creeps upward with the number of predicted clusters, regardless of the true number. This biases results toward the prediction of a large number of clusters, which is a danger to adequate CD evaluation. A related flaw noted by Peel et al. (2017) is that the measure is not homogeneous. Much the center of a circle will be closest on average to each point in it, the trivial partition into singleton communities scores highest under NMI, when averaged over all possible ground truths. As a homogeneous measure is a precondition of the No Free Lunch theorem (Wolpert, 1996), NMI in fact awards “free lunch” when guessing the singletons partition. We will later show practical consequences of this deficiency.

V Recipe for Proper Evaluation

Our recommended solution incorporates both practical scoring concerns and an improvement in probability notions over the de facto preference. Desires for an adjustment for chance to give a constant baseline and a constant top score are longstanding, as shown by the popular adjusted Rand index (Hubert and Arabie, 1985). The final recommendation is the adjusted mutual information (AMI), computing its expectation with the one-sided random model $\mathbb{M}^{1}_{\text{all}}$ .

To be useful to practitioners, an evaluation measure for community detection should have the properties of a constant baseline and a constant top score.

Constant baseline

In the statistical sense, we would like a consistent baseline: A random guess should merit no credit. This differs from the notion of a baseline in NMI, which is a lower bound. With this consistent baseline, we eliminate the “free lunch”. The proposal of Zhang (2015), relative normalized mutual information ( $\mathrm{rNMI}$ ), improves NMI by subtracting the expected NMI for this ground truth, such that a random guess garners a score of 0. In $\mathrm{rNMI}$ the expected NMI is computed under the permutation model ( $\mathcal{C}^{\prime}\sim\mathbb{M}_{\text{perm}}(\mathcal{C})$ ):

[TABLE]

Constant top score

The flaw of rNMI is that now only one reference point (the expectation) is fixed across clusters, not two like in NMI. This means that we can’t compare performance across clusterings of different sizes, and we don’t know whether we’ve succeeded by attaining the maximum value. Zhang et al. (2015) renormalize rNMI to create renormalized relative normalized mutual information ( $\mathrm{rrNMI}$ ):

[TABLE]

Now, we have both a constant expectation and a constant ceiling, remedying the flaw we’ve noted.

Symmetry

While we take the controversial stance that symmetry in the measure is undesirable, it is necessary to contextualize additional work, and for showing the connections between methods in section VI. Note that rrNMI’s denominator depends only on $\mathcal{T}$ . The information, though, of one distribution about another is not asymmetric. Lai and Nardini (2016) symmetrize rrNMI to create the corrected normalized mutual information ( $\mathrm{cNMI}$ ):

[TABLE]

Another measure, adjusted mutual information (AMI) (Vinh et al., 2009, 2010), was devised years earlier and incorporates all of these fixes in the style of the adjusted Rand index (Hubert and Arabie, 1985): 444The bound function in this AMI definition is $M(\mathcal{C},\mathcal{T})=\max_{\mathcal{C}^{\prime},\mathcal{T}^{\prime}}I(\mathcal{C}^{\prime},\mathcal{T}^{\prime})$ . In practice we could use any of the upper bounds as described in (Gates and Ahn, 2017), for example Equation 3, as long as it is a upper bound consistent with the chosen random model (here $\mathbb{M}_{\text{perm}}$ ).

[TABLE]

where the variables $\mathcal{C}^{\prime}\sim\mathbb{M}_{\text{perm}}(\mathcal{C})$ and $\mathcal{T}^{\prime}\sim\mathbb{M}_{\text{perm}}(\mathcal{T})$ under the $\max$ and $\operatorname{{\mathbb{E}}}$ operators above.

V.1 Probability improvements

Adjustment considers all relevant partitions

All of NMI’s successors share a fault. Their expectations are computed by incorrectly considering only partitions with the same decomposition pattern, or bag of cluster sizes. These aren’t the only partitions that could exist, though; the space of options is larger. A community detection algorithm could partition two nodes into either $\{1\},\{2\}$ or $\{1,2\}$ , and ignoring that fact makes assumptions about the structure of the possibility space. We should prefer $\mathbb{M}_{\text{all}}$ . Fortunately, simple closed-form approximations of its expectation are known (Gates and Ahn, 2017).

Adjustment recognizes that the ground truth is fixed.

Fixing the first fault in our symmetric measures leaves a second fault. We compute the expectation over all clusterings and all truths in the baseline $B(\mathcal{C},\mathcal{T})$ . But the truth for our problem is fixed. A different truth means we’re solving a different problem—core–periphery, etc. Our expectations should only ever be over the values of $\mathcal{C}$ :

[TABLE]

This all too is easy to plug into our formulas, as straightforward closed-form approximations again exist (Gates and Ahn, 2017).

The textbook AMI (before these probability improvements) is known to be asymptotically homogeneous, imparting negligible advantage for weak guesses in the limit as $N$ approaches infinity (Peel et al., 2017). By correcting for chance properly, we can make this asymptotically diminishing advantage exactly zero (McCarthy, 2018).

V.2 Bonus: Generalized mean

There’s a final component to our recommendation: How do we compute the upper bound function $M(\mathcal{C},\mathcal{T})$ ? Traditionally, different generalized means of the cluster entropies have been used (e.g., Equation 3 uses the geometric mean) (Vinh et al., 2010). Yang et al. (2016) have discussed that the particular choice of generalized mean is unimportant. Fortunately, the problem is simpler in our case: Gates and Ahn (2017) discuss that in $\mathbb{M}_{\text{all}}$ , the bounding entropy for either cluster is $\log N$ , and any generalized mean of $\log N$ and $\log N$ is $\log N$ .555By the generalized mean inequality. Hence we define our upper bound function in $\mathbb{M}_{\text{all}}$ to be

[TABLE]

VI Relationships Between Measures

Having waded through the alphabet soup of information-theoretic measures, we now present our major theoretical result: explorations of interrelations between measures. While we began this with our discussion of generalized means, these relationships exploit the random models and sidedness from section V.

The relationships are summarized in Figure 2. To characterize these relationships, we rely on the notion of function specialization: limiting a function’s expressiveness by fixing some parameters, altering its behavior (Young et al., 2018; Veldt et al., 2018). As the simplest example, one-sided measures in general specialize their two-sided counterparts by fixing the ground truth $\mathcal{T}$ , rather than taking an expectation over $\mathcal{T}$ ’s universe.

Our first finding is that in $\mathbb{M}_{\text{perm}}$ , AMI specializes itself: One-sided and two-sided AMI are equivalent. We demonstrate this using the exchangeability of partitions under permutation in Appendix A. For a graphical intuition, we redirect the viewer to Figure 1.

Building on this, we also show that rrNMI (an inherently one-sided, asymmetric measure) specializes AMI in $\mathbb{M}_{\text{perm}}$ . We exploit a clever form of $1$ : Dividing the numerator and denominator of rrNMI both by the maximum MI renders the expression identical to one-sided AMI. This does not generalize to other random models because it relies on the exchangeability of partitions.

Oddly, rrNMI also specializes cNMI into one-sided cNMI in all random models. This is a straightforward algebraic result. It becomes clear by expressing both measures in terms only of mutual information and its expectation.

This may lead the reader to suspect that cNMI and AMI are equivalent, at least under some random model. Unfortunately, under all three of the Gates and Ahn (2017) models, there are irremediable differences. In $\mathbb{M}_{\text{perm}}$ , the denominators compute expectations over different sets. In $\mathbb{M}_{\text{all}}$ and $\mathbb{M}_{\text{num}}$ , the lack of exchangeability renders the numerators distinct.

As a final conceptual introduction, we review the mediant, or “freshman sum”, of two fractions. It is given by separately adding the numerators and denominators of two fractions. It has the property that it lies between its two arguments. The cNMI symmetrizes rrNMI as the mediant of each one-sided measure, so we know that the cNMI is always bounded between the one-sided values. Finally, because the numerators are identical in each one-sided variant, the mediant is also the harmonic mean; this can be shown through simple algebraic manipulation.

Our recommended measure, one-sided AMI in $\mathbb{M}_{\text{all}}$ , is identical to rrNMI in $\mathbb{M}_{\text{all}}$ . Though rrNMI is already closer to an appropriate community detection evaluation, we follow parsimony and historical precedent to use the name AMI.

Therefore our recommended measure, the one-sided AMI in $\mathbb{M}_{\text{all}}$ , by combining our derived baseline function $B_{\rm all}^{1}$ in Equation 12 and upper bound function $M_{\rm all}$ in Equation 13, could be formally written as

[TABLE]

where $\mathcal{C}^{\prime}\sim\mathbb{M}_{\text{all}}(\mathcal{T})=\mathbb{M}_{\text{all}}(\mathcal{C})$ .

VII A View to a Trap

Having given the design of a fitting extrinsic evaluation, we now extol its need. Community detection methods can be viewed, at their core, simply as efficient heuristics for maximizing modularity while remaining tractable. In choosing a method, we first select graphs which we believe represent the distribution of graphs in our use-case. We then apply our methods to the graphs, scoring their predictions against the ground truth and choosing the method with best performance. (We may also factor running time into our decision—a cost-aware objective.) But NMI’s scores are misleading, making it a danger to our method selection process.

To showcase the danger, we set and spring a trap for NMI. We use a community detection method with a unique architecture, making it susceptible to choosing trivial clusterings. This elicits NMI’s pathology: NMI exaggerates the method’s performance compared to other methods, awarding credit to the trivial clustering. AMI corrects this exaggeration.

VIII The Trap: Leximin, An Adversarial Method

The leximin method is a divisive clustering method, motivated by congestion in traffic networks. It formulates community detection as a hierarchical version of the sparsest cut problem, an objective related to modularity. (Notably, both reduce to MAX-CUT (Matula and Shahrokhi, 1990), making them NP-hard. Veldt et al. (2018) show that both objectives specialize an underlying function.)

The sequence of cuts comes from a linear programming (LP) relaxation of the sparsest cut problem: the maximum concurrent flow problem (Shahrokhi and Matula, 1990). The model routes traffic flow between all pairs of nodes, with an objective that minimizes congestion while fairly satisfying demand Shahrokhi and Matula (1990). (In applications, the traffic could be goods in a commodity distribution network, gossip in a social network, or signals in a neural network.) When rerouting cannot avoid congestion, the saturated edges define a sparse “bottleneck cut”. Continuing to increase the allocation of flow gives a sequence of bottleneck cuts which dissect the graph Matula and Shahrokhi (1990).

Or at least, that would be nice. Lamentably, the tractability comes from weak duality between hierarchical sparsest cut and the hierarchical maximum concurrent flow problem. The relaxation will either find the sparsest cut or a multipart cut (a grid) when the maximum throughput is less than the density of any cut. This is consistent with LPs being in the complexity class P while sparsest cut is NP-hard. The competing forces of cut density and the gridlock bound mean that the graph may splinter into single-node communities without any mesoscopic structure in between. We exploit the fact that complete gridlock becomes reliable for graphs of certain structural properties examined below, as shown in Figure 3.

IX Springing the Trap: Experiment

With the trap in place, we now spring it with an experiment of the fashion described above. We benchmark the NMI and AMI of eight popular CD methods, plus the adversarial case: the leximin method. As our graph distribution, we chose the Lancichinetti–Fortunato–Radicchi (LFR) benchmark graphs, which manifest a ground truth and obey properties of real-world networks (Lancichinetti et al., 2008).

Following Yang et al. (2016), we test $25$ LFR realizations for each combination of parameters $N\in\{80,100,\dots,240\}$ and $\mu\in\{0.03,0.09,\dots,0.75\}$ . Here, $\mu$ is the mixing parameter. It controls the fraction of each node’s edges that connect outside of its community, and it can be thought of as a knob to increase the difficulty of the community detection task. All other parameters were set as in Yang et al. (2016). We report the mean and standard deviation of each score for each combination of parameters.

The eight methods we compare against are: Fastgreedy Clauset et al. (2004), Infomap Rosvall and Bergstrom (2007); Rosvall et al. (2009), label propagation Raghavan et al. (2007), leading eigenvector Newman (2006b), multilevel Blondel et al. (2008), spinglass Reichardt and Bornholdt (2006), and walktrap Pons and Latapy (2005). For a concise description of each method, the reader is directed to Yang et al. (2016).

Implementation details

The leximin method is implemented in AMPL Fourer et al. (1990). While the hierarchical MCFP could be expressed as a single LP, we exploit the lexicographic problem structure to decompose the problem into a sequence of $N-1$ smaller, subordinate LPs (Podinovskii, 1972). Using AMPL’s CPLEX backend, the method solves the LP of Dong et al.’s triples formulation of the MCFP Dong et al. (2015). The eight other algorithms are implemented in the igraph package, accessed through igraph’s R interface. Determining optimal modularity is built-in for the igraph algorithms; it is done using the networkx Python library Hagberg et al. (2008) for leximin. Cluster evaluation is done for all using the NMI and AMI functions of the scikit-learn Python package Pedregosa et al. (2011).

X Postmortem of a Trapped Measure: Discussion and Results

The most useful graph characteristic to capture the performance of a community detection algorithm is the strength of its communities (Yang et al., 2016). One proposed distinction of communities is into “strong” and “weak” (Radicchi et al., 2004). Strong communities are tight-knit: their nodes all have more edges to each other (intra-community) than to other communities’ nodes (inter-community). In the LFR generator, the strength of communities is controlled en masse by the mixing coefficient, $\mu$ . Values closer to $1$ indicate a harder problem, edging us toward a detectability limit beyond which no method (real or hypothetical) can identify communities.

We present the mean and standard deviation of our scores in Figure 4. Those graphs which took longer than $3$ hours to process are excluded, though all graphs with $n\geq 180$ and $\mu\geq 0.5$ were processed within this bound. The surprising fact that larger graphs were processed faster comes from our column generation scheme: When gridlock splinters the graph into singletons at an early stage, we’ve reached our optimum, and the remaining $O(N)$ LPs need not be solved—a tremendous reduction in computational labor.

Cranking the mixing coefficient $\mu$ higher makes communities weaker. Our tests revealed, as expected, that the leximin method’s NMI begins to surpass the others, staying high when they fall. Infomap and Label Propagation plunge down to 0, while Leading Eigenvector and Fastgreedy have gradual downward slopes, never performing nearly as well as the others. More typically, Multilevel, Spinglass, and Walktrap remain high while communities are strong, then fall near the boundary of $\mu=0.5$ .

These results, though, are very misleading. Remember from Figure 3 that gridlock occurs almost consistently when we cross the threshold into weak communities. This means that no mesoscopic structure is detected. Why should this be scored higher than methods that extract some of the embedded patterns?

When partitions are scored using AMI, most methods’ performance profiles are unchanged. The Louvian (multilevel) method and Infomap, for instance, continue to perform strongly on strong communities and drop off at the same values of $\mu$ . The edge betweenness algorithm’s performance drops faster as $\mu\rightarrow 1$ , but its trend is similar to the trend under NMI. The same is true for Fastgreedy method. By contrast, AMI gives a completely different characterization of the leximin method’s performance. Against the backdrop of this relative consistency, we see a dramatic drop in the performance of the leximin method, lining up with the increased rate of predicting singleton clusterings. Leximin’s score for high $\mu$ is zero. This is a fairer assessment of the methods’ ability, matching our intuition about how good one clustering should be compared to another—particularly, the fact that the singleton clustering finds no community structure, so we should regard this as a failure when community structure does exist.

After a broad study on factors affecting community detection, Yang et al. (2016) proposed a method for choosing a community detection algorithm. Based on runtime and NMI, the authors recommended certain community detection algorithms for particular combinations of $N$ and $\mu$ . If we followed their method precisely, then the leximin method would be recommended for small graphs with weak communities. With the truth quantified by AMI, we see that leximin should only be used for networks with strong communities, and its runtime limits its application to small to medium-sized graphs.

XI Related Work

Over the years, numerous evaluation methods for community detection have been suggested. While NMI has historically been the most popular (first proposed for community detection by Danon et al. (2005)), others include the variation of information (Meilă, 2003), the V-measure (Rosenberg and Hirschberg, 2007) (a rediscovery of NMI), the adjusted Rand index (ARI) (Hubert and Arabie, 1985), F-score (Dhillon et al., 2003), and Cohen’s $\kappa$ (Liu et al., 2018).

The standardized mutual information (Romano et al., 2014) was proposed as another adjustment for chance. Using instead the variance as the denominator, it no longer has the constant top score we seek. Romano et al. (2016) identified both AMI and ARI as specializations of a general function. Finally, Decelle et al. (2011) proposed a measure called overlap that is adjusted for chance. We favor AMI because it is most similar to the conventional measure and also meets our desiderata.

XII Conclusion and Future Work

We see that the leximin method, like many, is successful on strong communities and is incorrectly appraised by NMI. While NMI was the best known measure when Danon et al. first used it for CD Danon et al. (2005), we see that NMI can exaggerate community detection performance. Inertia in the CD community is likely why a departure from NMI has not occurred. Nevertheless, the measure’s flaws demand a move to a more robust measure. The choice of AMI corrects this exaggeration, so we encourage AMI’s use in the CD community moving forward.

The curious reader may note that mutual information is already a symmetric measure. An asymmetric measure like relative entropy makes a promising avenue for future work. Another important avenue is finding fair and appropriate measures for hierarchical or overlapping clustering (Horta and Campello, 2015). Finally, future work will assess CD methods and the related $k$ -way partition problem using one-sided AMI under $\mathbb{M}_{\mathrm{all}}$ and $\mathbb{M}_{\mathrm{num}}$ respectively, as well as varying the generalized mean parameter $p$ .

Appendix A Self-specialization of AMI

We show that the one-sided AMI and two-sided AMI are identical under $\mathbb{M}_{\text{perm}}$ —that AMI specializes itself. Superficially, the differences between the two equations are in the expectation and the max-term—both are over different distributions. By showing that each of these are in fact identical between the equations, we will show that AMI self-specializes. For this, we will discuss clusterings $\mathcal{C}$ and $\mathcal{T}$ on set $X$ .

The upper bound function is straightforward; it is some generalized mean $M_{p}$ of the two clusterings’ entropies $H(\mathcal{C})$ and $H(\mathcal{T})$ . This bound is unchanged whether using one-sided or two-sided AMI, because the entropy of a clustering depends only on its decomposition pattern—not the actual elements. So much for the upper bound function.

Now, we must show that the expectation is the same regardless of whether one or both variables is bound.666The reader may recall Figure 1 if a concrete version of the problem is helpful. Formally,

[TABLE]

given $\mathcal{C}^{\prime}\in\mathbb{M}_{\text{perm}}(\mathcal{C})$ and $\mathcal{T}^{\prime}\in\mathbb{M}_{\text{perm}}(\mathcal{T})$ . We note that either variable may be fixed due to the symmetry of mutual information, and the claim will still hold.

We will first show that the expectation is impervious to permutations of the input, then we will use this resilience to show that the expectation is the same for one- or two-sided expectations.

It is clear from the definition of mutual information (Equation 6) that it is permutation-invariant: it relies on intersections and sizes of sets, and in aggregate these are unaltered by permuting the labels.

Now, we rely on the following theorem:

Theorem 1 (Schumacher et al. (2001)).

For any set $X$ , given partition $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ on $X$ with the same shape, another partition $\mathcal{C}$ on $X$ (not necessarily the same shape as $\mathcal{T}_{1}$ ), and any permutation-invariant function $F(\cdot,\cdot)$ on pairs of partitions, we have

[TABLE]

This amounts to saying that the expectation is independent of the particular fixed variable ( $\mathcal{C}$ or $\mathcal{T}$ ) so long as the fixed variable is drawn from $\mathbb{M}_{\text{perm}}$ . That is:

[TABLE]

This value is a row average in Figure 1; more broadly, it is a one-sided expectation for the randomness model $\mathbb{M}_{\text{perm}}$ . What remains is to show that this average matches the two-sided average, which relies on simple algebra.

[TABLE]

Thus, the two-sided expectation matches the one-sided expectation. ∎

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Matula (1977) D. W. Matula, in Classification and clustering , edited by J. van Ryzin (Elsevier, 1977) pp. 95–129.
2Lai and Nardini (2016) D. Lai and C. Nardini, Journal of Statistical Mechanics: Theory and Experiment 2016 , 093403 (2016) .
3Peel et al. (2017) L. Peel, D. B. Larremore, and A. Clauset, Science Advances 3 (2017), 10.1126/sciadv.1602548 . · doi ↗
4Vinh et al. (2010) N. X. Vinh, J. Epps, and J. Bailey, Journal of Machine Learning Research 11 , 2837 (2010).
5Zhang et al. (2015) J. Zhang, T. Chen, and J. Hu, Journal of Statistical Mechanics: Theory and Experiment 2015 , P 03009 (2015) .
6Zhang (2015) P. Zhang, Journal of Statistical Mechanics: Theory and Experiment 2015 , P 11006 (2015) .
7Mc Carthy (2017) A. D. Mc Carthy, Gridlock in Networks: The Leximin Method for Hierarchical Community Detection , Master’s thesis , Southern Methodist University (2017).
8Gates and Ahn (2017) A. J. Gates and Y.-Y. Ahn, Journal of Machine Learning Research 18 , 1 (2017) .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Metrics matter in community detection

Abstract

pacs:

I Introduction

II Community Detection

III Preliminaries

III.1 Partition structures

III.2 Random models

III.3 Information theory

IV What We’ve Been Doing Wrong

V Recipe for Proper Evaluation

Constant baseline

Constant top score

Symmetry

V.1 Probability improvements

Adjustment considers all relevant partitions

Adjustment recognizes that the ground truth is fixed.

V.2 Bonus: Generalized mean

VI Relationships Between Measures

VII A View to a Trap

VIII The Trap: Leximin, An Adversarial Method

IX Springing the Trap: Experiment

Implementation details

X Postmortem of a Trapped Measure: Discussion and Results

XI Related Work

XII Conclusion and Future Work

Appendix A Self-specialization of AMI

Theorem 1** (Schumacher et al. (2001)).**

Theorem 1 (Schumacher et al. (2001)).