U-statistical inference for hierarchical clustering
Marcio Valk, Gabriela Bettella Cybis

TL;DR
This paper introduces a U-statistics based method for assessing the significance of hierarchical clustering, especially effective for high-dimensional low-sample-size data, with proven power and broad applicability.
Contribution
It develops a novel U-statistics based approach for significance testing in hierarchical clustering tailored to HDLSS data, including new algorithms and asymptotic theory.
Findings
Methods outperform competing alternatives in simulations
Algorithms are effective in genetics and image recognition applications
Approach relies on minimal assumptions about data
Abstract
Clustering methods are a valuable tool for the identification of patterns in high dimensional data with applications in many scientific problems. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with High Dimension Low Sample Size (HDLSS) data. We develop here a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These non-parametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the euclidean distance captures relevant features. We propose two significance clustering algorithms, a hierarchical method and a non-nested version. In order to do so, we first propose an extension of a relevant U-statistics and develop its asymptotic theory. Our methods are tested through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Complex Network Analysis Techniques
