Recovering the number of clusters in data sets with noise features using   feature rescaling factors

Renato Cordeiro de Amorim; Christian Hennig

arXiv:1602.06989·stat.ML·February 24, 2016

Recovering the number of clusters in data sets with noise features using feature rescaling factors

Renato Cordeiro de Amorim, Christian Hennig

PDF

TL;DR

This paper proposes three feature re-scaling methods that enhance clustering validity indexes' ability to accurately identify the true number of spherical Gaussian clusters in data sets with noise features.

Contribution

It introduces novel re-scaling techniques that adapt to data structure and feature relevance, improving cluster number estimation accuracy.

Findings

01

Methods increase likelihood of correct cluster number detection.

02

Improved performance across multiple validity indexes.

03

Effective in presence of noise features.

Abstract

In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the p $^{t h}$ power of the Minkowski distance), Dunn's, Calinski-Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.