Wasserstein $K$-means for clustering probability distributions

Yubo Zhuang; Xiaohui Chen; Yun Yang

arXiv:2209.06975·stat.ML·October 14, 2022·5 cites

Wasserstein $K$-means for clustering probability distributions

Yubo Zhuang, Xiaohui Chen, Yun Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a Wasserstein $K$-means clustering method for probability distributions, demonstrating its advantages over centroid-based approaches, especially in the context of Gaussian distributions and real-world data.

Contribution

The paper proposes a distance-based Wasserstein $K$-means with SDP relaxation, showing it can recover true clusters and outperform centroid-based methods in clustering distributions.

Findings

01

SDP relaxation achieves exact recovery for well-separated Gaussian clusters.

02

Distance-based Wasserstein $K$-means outperforms centroid-based methods in experiments.

03

Wasserstein barycenters face regularity issues affecting centroid-based clustering.

Abstract

Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$ -means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the $K$ -means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based $K$ -means approach and its semidefinite program (SDP) relaxation are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yubo02/wasserstein-k-means-for-clustering-probability-distributions
pytorchOfficial

Videos

Wasserstein $K$-means for clustering probability distributions· slideslive

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Topological and Geometric Data Analysis