Clustering is difficult only when it does not matter

Amit Daniely; Nati Linial; Michael Saks

arXiv:1205.4891·cs.LG·May 23, 2012·21 cites

Clustering is difficult only when it does not matter

Amit Daniely, Nati Linial, Michael Saks

PDF

Open Access

TL;DR

This paper argues that clustering is only difficult in worst-case scenarios and demonstrates that for data sets that can be well-clustered, efficient algorithms are often possible, challenging the common perception of clustering as inherently hard.

Contribution

The paper introduces a theoretical framework for clustering in metric spaces and shows that good clusterings can often be found efficiently when they exist.

Findings

01

Good clusterings can be efficiently identified if they exist.

02

Clustering difficulty is mainly a concern in worst-case scenarios.

03

Practitioners' optimism is justified for well-clusterable data.

Abstract

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models