How many clusters? An information theoretic perspective

Susanne Still; William Bialek

arXiv:physics/0303011·physics.data-an·May 23, 2007·3 cites

How many clusters? An information theoretic perspective

Susanne Still, William Bialek

PDF

Open Access

TL;DR

This paper introduces an information theoretic approach to determine the optimal number of clusters in data by considering a temperature parameter that balances detail and noise, avoiding external goodness criteria.

Contribution

It proposes a novel method that uses a statistical mechanics perspective to identify the maximum meaningful number of clusters based on data size and sampling bias.

Findings

01

Finite data sets impose limits on resolvable structure.

02

Optimal clustering temperature depends on data size.

03

Method effectively finds the maximum number of meaningful clusters.

Abstract

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade--off between energy-- and entropy--like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Time Series Analysis and Forecasting