An empirical comparison and characterisation of nine popular clustering methods
Christian Hennig

TL;DR
This study empirically compares nine popular clustering methods across 42 datasets, analyzing their ability to recover true clusterings and characterizing their properties using various validation indexes.
Contribution
It provides a detailed characterization of clustering methods and relates cluster properties to their similarity with true clusterings, aiding method selection.
Findings
Methods vary in their ability to recover true clusterings.
Cluster properties influence similarity to true clusterings.
Insights into expected clustering properties from different methods.
Abstract
Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Face and Expression Recognition
