Selective inference for k-means clustering
Yiqun T. Chen, Daniela M. Witten

TL;DR
This paper introduces a selective inference method for testing differences in means between clusters identified by k-means, controlling Type I error in finite samples and applicable to various data types.
Contribution
It proposes a finite-sample p-value for cluster mean differences that accounts for the selection process in k-means clustering, improving inference accuracy.
Findings
The method effectively controls Type I error in simulations.
It is computationally efficient for practical use.
Demonstrated on handwritten digits and single-cell RNA data.
Abstract
We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering, and show that it can be efficiently computed. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Gene expression and cancer classification · SARS-CoV-2 detection and testing
