Powerful Significance Testing for Unbalanced Clusters
Thomas H. Keefe, J.S. Marron

TL;DR
This paper introduces a new statistical testing method for clustering that is effective even with unbalanced cluster sizes, addressing limitations of existing methods like SigClust, and demonstrates its utility on gene expression data.
Contribution
The paper proposes a generalized k-means based significance test that improves power for unbalanced clusters, a common challenge in high-dimensional data analysis.
Findings
The new method outperforms SigClust in unbalanced cluster scenarios.
Application to kidney cancer gene expression data shows practical effectiveness.
The approach is available as a Python implementation.
Abstract
Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is, "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case, and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Data Mining Algorithms and Applications · Bayesian Methods and Mixture Models
