Revisiting Agglomerative Clustering

Eric K. Tokuda; Cesar H. Comin; Luciano da F. Costa

arXiv:2005.07995·cs.LG·June 30, 2020

Revisiting Agglomerative Clustering

Eric K. Tokuda, Cesar H. Comin, Luciano da F. Costa

PDF

TL;DR

This paper evaluates various agglomerative clustering methods on different data distributions, proposing an objective way to identify true clusters from dendrograms and analyzing their effectiveness in avoiding false positives.

Contribution

It introduces a model for cluster relevance based on dendrogram heights and assesses the robustness of different agglomerative methods across diverse datasets.

Findings

01

Single-linkage is more resistant to false positives.

02

Many methods detect spurious clusters in unimodal data.

03

Cluster relevance can be quantified by dendrogram subtree heights.

Abstract

An important issue in clustering concerns the avoidance of false positives while searching for clusters. This work addressed this problem considering agglomerative methods, namely single, average, median, complete, centroid and Ward's approaches applied to unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions. A model of clusters was also adopted, involving a higher density nucleus surrounded by a transition, followed by outliers. This paved the way to defining an objective means for identifying the clusters from dendrograms. The adopted model also allowed the relevance of the clusters to be quantified in terms of the height of their subtrees. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.