Sanity Check for External Clustering Validation Benchmarks using   Internal Validation Measures

Hyeon Jeon; Michael Aupetit; DongHwa Shin; Aeri Cho; Seokhyeon Park,; Jinwook Seo

arXiv:2209.10042·cs.LG·September 22, 2022·6 cites

Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

Hyeon Jeon, Michael Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park,, Jinwook Seo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new method to evaluate the alignment between class labels and true clusters across datasets, improving the reliability of external clustering validation benchmarks.

Contribution

It proposes a novel framework for extending internal validation measures to compare cluster-label matching across datasets, addressing a key gap in external validation.

Findings

01

The generalized Calinski-Harabasz index effectively evaluates CLM across datasets.

02

The proposed measures satisfy four new axioms for between-dataset internal validation.

03

Evaluating CLM before external validation improves benchmarking reliability.

Abstract

We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hj-n/labeled-datasets
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Text and Document Classification Technologies · Bayesian Methods and Mixture Models