Adjusting for Chance Clustering Comparison Measures
Simone Romano, Nguyen Xuan Vinh, James Bailey, Karin Verspoor

TL;DR
This paper develops a unified framework for adjusting clustering comparison measures based on information theory and pair-counting, providing guidelines for their appropriate application based on cluster size balance.
Contribution
It analytically computes expected values and variances of generalized information-theoretic measures, bridging pair-counting and Shannon-based adjustments, and offers practical usage guidelines.
Findings
Adjusted measures reduce to known metrics in special cases
Guidelines for using ARI and AMI based on cluster size balance
Analytical formulas enable better measure adjustment
Abstract
Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair-counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair-counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Mechanics and Entropy · Advanced Clustering Algorithms Research · Complex Systems and Time Series Analysis
