A penalized criterion for selecting the number of clusters for K-medians

Antoine Godichon-Baggioni (LPSM (UMR\_8001)); Sobihan Surendran (LPSM; (UMR\_8001))

arXiv:2209.03597·math.ST·February 28, 2024·J. Comput. Graph. Stat.

A penalized criterion for selecting the number of clusters for K-medians

Antoine Godichon-Baggioni (LPSM (UMR\_8001)), Sobihan Surendran (LPSM, (UMR\_8001))

PDF

Open Access

TL;DR

This paper introduces a penalized criterion for selecting the optimal number of clusters in K-medians clustering, especially effective for contaminated data, and validates it through simulations and R package implementation.

Contribution

It proposes a new penalized risk criterion for choosing the number of clusters in K-medians, with theoretical justification and practical comparison.

Findings

01

The penalty shape is suitable for K-medians clustering.

02

The method performs well in simulations with contaminated data.

03

The approach is implemented in the R package Kmedians.

Abstract

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Statistical Methods and Inference