Statistical power for cluster analysis
E. S. Dalmaijer, C. L. Nord, and D. E. Astle

TL;DR
This study evaluates statistical power in cluster analysis, demonstrating how factors like effect size, sample size, and clustering method influence the ability to detect true subgroups in biomedical data.
Contribution
It provides the first comprehensive simulation-based assessment of power for various cluster analysis pipelines, guiding researchers on optimal practices.
Findings
Large effect sizes enable detection with small samples (N=20 per subgroup).
Fuzzy clustering outperforms traditional methods in overlapping distributions.
Multidimensional scaling enhances cluster separation and power.
Abstract
Cluster algorithms are increasingly popular in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and accuracy for common analysis pipelines through simulation. We varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction (none, multidimensional scaling, or UMAP) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models
