Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions
Ery Arias-Castro

TL;DR
This paper provides theoretical guarantees for pairwise distance-based clustering methods in high-dimensional, mixed-dimension data, demonstrating near-optimal performance and robustness to outliers.
Contribution
It offers the first theoretical analysis of several pairwise distance clustering algorithms in complex, high-dimensional settings with mixed data characteristics.
Findings
Connected components clustering is effective with large data.
Spectral clustering enjoys near-optimal separation properties.
Local scaling improves scale selection and clustering robustness.
Abstract
In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models
