Statistical Properties of the Single Linkage Hierarchical Clustering Estimator
Dekang Zhu, Dan P. Guralnik, Xuezhi Wang, Xiang Li, Bill Moran

TL;DR
This paper examines the statistical properties of the single linkage hierarchical clustering (SLHC) estimator under measurement noise, showing that MLE-based methods outperform SLHC in recovering true clustering structures.
Contribution
It introduces a statistical framework for analyzing SLHC under noisy measurements and compares its performance to maximum likelihood estimation (MLE), highlighting the advantages of MLE.
Findings
SLHC is equivalent to maximum partial profile likelihood estimation under certain conditions.
Direct MLE of pairwise distances provides a consistent estimator for hierarchical clustering.
MLE-based methods are expected to outperform SLHC in accurately recovering the true metric.
Abstract
Distance-based hierarchical clustering (HC) methods are widely used in unsupervised data analysis but few authors take account of uncertainty in the distance data. We incorporate a statistical model of the uncertainty through corruption or noise in the pairwise distances and investigate the problem of estimating the HC as unknown parameters from measurements. Specifically, we focus on single linkage hierarchical clustering (SLHC) and study its geometry. We prove that under fairly reasonable conditions on the probability distribution governing measurements, SLHC is equivalent to maximum partial profile likelihood estimation (MPPLE) with some of the information contained in the data ignored. At the same time, we show that direct evaluation of SLHC on maximum likelihood estimation (MLE) of pairwise distances yields a consistent estimator. Consequently, a full MLE is expected to perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Data Management and Algorithms
