Selective inference for k-means clustering

Yiqun T. Chen; Daniela M. Witten

arXiv:2203.15267·stat.ME·March 30, 2022·J. Mach. Learn. Res.·23 cites

Selective inference for k-means clustering

Yiqun T. Chen, Daniela M. Witten

PDF

Open Access 2 Repos

TL;DR

This paper introduces a selective inference method for testing differences in means between clusters identified by k-means, controlling Type I error in finite samples and applicable to various data types.

Contribution

It proposes a finite-sample p-value for cluster mean differences that accounts for the selection process in k-means clustering, improving inference accuracy.

Findings

01

The method effectively controls Type I error in simulations.

02

It is computationally efficient for practical use.

03

Demonstrated on handwritten digits and single-cell RNA data.

Abstract

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering, and show that it can be efficiently computed. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Gene expression and cancer classification · SARS-CoV-2 detection and testing