Wasserstein $K$-means for clustering probability distributions
Yubo Zhuang, Xiaohui Chen, Yun Yang

TL;DR
This paper introduces a Wasserstein $K$-means clustering method for probability distributions, demonstrating its advantages over centroid-based approaches, especially in the context of Gaussian distributions and real-world data.
Contribution
The paper proposes a distance-based Wasserstein $K$-means with SDP relaxation, showing it can recover true clusters and outperform centroid-based methods in clustering distributions.
Findings
SDP relaxation achieves exact recovery for well-separated Gaussian clusters.
Distance-based Wasserstein $K$-means outperforms centroid-based methods in experiments.
Wasserstein barycenters face regularity issues affecting centroid-based clustering.
Abstract
Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used -means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the -means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based -means approach and its semidefinite program (SDP) relaxation are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Topological and Geometric Data Analysis
