Dimensionality's Blessing: Clustering Images by Underlying Distribution
Wen-Yan Lin, Siying Liu, Jian-Huang Lai, Yasuyuki Matsushita

TL;DR
This paper reinterprets the contrast-loss phenomenon in high-dimensional data as a blessing, leading to a novel distribution-based clustering algorithm that effectively groups data by underlying distributions, even with unlabeled and noisy data.
Contribution
The paper introduces distribution-clustering, a new algorithm that leverages the hyper-shell concentration phenomenon to automatically identify and cluster data based on their underlying distributions.
Findings
Creates clean, well-separated clusters from raw data
Automatically estimates the number of clusters
Robust to outliers and noise
Abstract
Many high dimensional vector distances tend to a constant. This is typically considered a negative "contrast-loss" phenomenon that hinders clustering and other machine learning techniques. We reinterpret "contrast-loss" as a blessing. Re-deriving "contrast-loss" using the law of large numbers, we show it results in a distribution's instances concentrating on a thin "hyper-shell". The hollow center means apparently chaotically overlapping distributions are actually intrinsically separable. We use this to develop distribution-clustering, an elegant algorithm for grouping of data points by their (unknown) underlying distribution. Distribution-clustering, creates notably clean clusters from raw unlabeled data, estimates the number of clusters for itself and is inherently robust to "outliers" which form their own clusters. This enables trawling for patterns in unorganized data and may be the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Image Retrieval and Classification Techniques · Bayesian Methods and Mixture Models
