Clustering is difficult only when it does not matter
Amit Daniely, Nati Linial, Michael Saks

TL;DR
This paper argues that clustering is only difficult in worst-case scenarios and demonstrates that for data sets that can be well-clustered, efficient algorithms are often possible, challenging the common perception of clustering as inherently hard.
Contribution
The paper introduces a theoretical framework for clustering in metric spaces and shows that good clusterings can often be found efficiently when they exist.
Findings
Good clusterings can be efficiently identified if they exist.
Clustering difficulty is mainly a concern in worst-case scenarios.
Practitioners' optimism is justified for well-clusterable data.
Abstract
Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models
