Hierarchical Clustering With Confidence
Di Wu, Jacob Bien, Snigdha Panigrahi

TL;DR
This paper introduces a randomized hierarchical clustering method with a statistical testing framework that provides valid p-values for cluster significance, improving stability assessment and cluster number estimation.
Contribution
It proposes a simple randomization scheme and a hypothesis testing method for hierarchical clustering that controls Type I error and enhances cluster validation.
Findings
The method controls false positives in cluster detection.
It outperforms existing approaches in power during simulations.
The approach effectively estimates the number of clusters in real data.
Abstract
Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Complex Network Analysis Techniques
