Powerful Significance Testing for Unbalanced Clusters

Thomas H. Keefe; J.S. Marron

arXiv:2308.13079·stat.ME·August 28, 2023·J. Comput. Graph. Stat.

Powerful Significance Testing for Unbalanced Clusters

Thomas H. Keefe, J.S. Marron

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new statistical testing method for clustering that is effective even with unbalanced cluster sizes, addressing limitations of existing methods like SigClust, and demonstrates its utility on gene expression data.

Contribution

The paper proposes a generalized k-means based significance test that improves power for unbalanced clusters, a common challenge in high-dimensional data analysis.

Findings

01

The new method outperforms SigClust in unbalanced cluster scenarios.

02

Application to kidney cancer gene expression data shows practical effectiveness.

03

The approach is available as a Python implementation.

Abstract

Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is, "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case, and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thomaskeefe/sigclust
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Data Mining Algorithms and Applications · Bayesian Methods and Mixture Models