A Data-Driven Approach to Estimating the Number of Clusters in   Hierarchical Clustering

Antoine Zambelli

arXiv:1608.04700·q-bio.QM·August 17, 2016

A Data-Driven Approach to Estimating the Number of Clusters in Hierarchical Clustering

Antoine Zambelli

PDF

TL;DR

This paper introduces two new data-driven, fully automated methods for estimating the number of clusters in hierarchical clustering, demonstrating superior performance over traditional methods on simulated and real gene expression data.

Contribution

The paper presents novel, easy-to-implement, computationally efficient methods that require no researcher input for estimating cluster numbers in hierarchical clustering.

Findings

01

Outperform Gap statistic and Elbow methods in multi-cluster scenarios

02

Effective on simulated datasets and gene expression data

03

Fully automated with no human intervention

Abstract

We propose two new methods for estimating the number of clusters in a hierarchical clustering framework in the hopes of creating a fully automated process with no human intervention. The methods are completely data-driven and require no input from the researcher, and as such are fully automated. They are quite easy to implement and not computationally intensive in the least. We analyze performance on several simulated data sets and the Biobase Gene Expression Set, comparing our methods to the established Gap statistic and Elbow methods and outperforming both in multi-cluster scenarios.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.