Clustering High-dimensional Data via Feature Selection
Tianqi Liu, Yu Lu, Biqing Zhu, and Hongyu Zhao

TL;DR
This paper introduces SC-FS, a spectral clustering method with feature selection for high-dimensional data, which effectively identifies informative features and achieves optimal clustering accuracy, demonstrated on real datasets.
Contribution
The paper proposes a novel spectral clustering with feature selection method that accurately identifies informative features and achieves minimax optimal clustering error rates in high-dimensional settings.
Findings
Successfully identifies all informative features with high probability.
Achieves minimax optimal clustering error rate for sparse Gaussian mixtures.
Demonstrates effectiveness on four real-world high-dimensional datasets.
Abstract
High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, i.e., the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real world data sets demonstrate its usefulness in clustering high-dimensional data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Gene expression and cancer classification · Face and Expression Recognition
