Revisiting Agglomerative Clustering
Eric K. Tokuda, Cesar H. Comin, Luciano da F. Costa

TL;DR
This paper evaluates various agglomerative clustering methods on different data distributions, proposing an objective way to identify true clusters from dendrograms and analyzing their effectiveness in avoiding false positives.
Contribution
It introduces a model for cluster relevance based on dendrogram heights and assesses the robustness of different agglomerative methods across diverse datasets.
Findings
Single-linkage is more resistant to false positives.
Many methods detect spurious clusters in unimodal data.
Cluster relevance can be quantified by dendrogram subtree heights.
Abstract
An important issue in clustering concerns the avoidance of false positives while searching for clusters. This work addressed this problem considering agglomerative methods, namely single, average, median, complete, centroid and Ward's approaches applied to unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions. A model of clusters was also adopted, involving a higher density nucleus surrounded by a transition, followed by outliers. This paved the way to defining an objective means for identifying the clusters from dendrograms. The adopted model also allowed the relevance of the clusters to be quantified in terms of the height of their subtrees. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
