Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures
Hyeon Jeon, Michael Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park,, Jinwook Seo

TL;DR
This paper introduces a new method to evaluate the alignment between class labels and true clusters across datasets, improving the reliability of external clustering validation benchmarks.
Contribution
It proposes a novel framework for extending internal validation measures to compare cluster-label matching across datasets, addressing a key gap in external validation.
Findings
The generalized Calinski-Harabasz index effectively evaluates CLM across datasets.
The proposed measures satisfy four new axioms for between-dataset internal validation.
Evaluating CLM before external validation improves benchmarking reliability.
Abstract
We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Text and Document Classification Technologies · Bayesian Methods and Mixture Models
