Clustering is Easy When ....What?
Shai Ben-David

TL;DR
This paper reviews the theoretical landscape of clustering, emphasizing that practical clustering is often feasible despite NP-hardness, especially on data that is realistically clusterable, and discusses the gap between theory and practice.
Contribution
It provides a critical overview of existing results supporting the idea that clustering is computationally easier on practically relevant data, and highlights research challenges in formalizing this.
Findings
Clustering is NP-hard in the worst case.
Practical clustering often succeeds despite theoretical hardness.
There is a gap between theoretical results and practical clustering success.
Abstract
It is well known that most of the common clustering objectives are NP-hard to optimize. In practice, however, clustering is being routinely carried out. One approach for providing theoretical understanding of this seeming discrepancy is to come up with notions of clusterability that distinguish realistically interesting input data from worst-case data sets. The hope is that there will be clustering algorithms that are provably efficient on such "clusterable" instances. This paper addresses the thesis that the computational hardness of clustering tasks goes away for inputs that one really cares about. In other words, that "Clustering is difficult only when it does not matter" (the \emph{CDNM thesis} for short). I wish to present a a critical bird's eye overview of the results published on this issue so far and to call attention to the gap between available and desirable results on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Algorithms and Data Compression
