A Graph-based Approach to Estimating the Number of Clusters in High-dimensional Settings
Yichuan Bai, Lynna Chu

TL;DR
This paper introduces a graph-based, non-parametric method for accurately estimating the number of clusters in high-dimensional datasets, demonstrating superior performance over existing methods through simulations and real data applications.
Contribution
The paper presents a novel graph-based statistic for estimating cluster numbers that is dimension-agnostic, computationally efficient, and theoretically consistent.
Findings
Outperforms existing methods in high-dimensional simulations
Effective on imaging and RNA-seq datasets
Provides asymptotic consistency proof
Abstract
We consider the problem of estimating the number of clusters (k) in a dataset. We propose a non-parametric approach to the problem that utilizes similarity graphs to construct a robust statistic that effectively captures similarity information among observations. This graph-based statistic is applicable to datasets of any dimension, is computationally efficient to obtain, and can be paired with any kind of clustering technique. Asymptotic theory is developed to establish the selection consistency of the proposed approach. Simulation studies demonstrate that the graph-based statistic outperforms existing methods for estimating k, especially in the high-dimensional setting. We illustrate its utility on an imaging dataset and an RNA-seq dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Graph Labeling and Dimension Problems · Graph theory and applications
