Clusterability test for categorical data
Lianyu Hu, Junjie Dong, Mudi Jiang, Yan Liu, Zengyou He

TL;DR
This paper introduces TestCat, a statistically rigorous method for evaluating the clusterability of categorical data by analyzing attribute associations, addressing a previously overlooked problem in cluster analysis.
Contribution
The paper presents the first effective statistical test for categorical data clusterability, using chi-squared statistics to assess attribute associations.
Findings
TestCat outperforms existing numeric data clusterability methods on benchmark datasets.
It provides an analytical p-value to determine clusterability of categorical data.
The approach is the first to reliably evaluate categorical data clusterability.
Abstract
The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical -value. The key idea underlying TestCat is that clusterable categorical data possess many strongly associated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for -value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Sensory Analysis and Statistical Methods
MethodsFocus
