Distribution free optimality intervals for clustering
Marina Meil\u{a}, Hanyu Zhang

TL;DR
This paper introduces a distribution-free method to validate clustering results by providing guarantees on their optimality and stability, applicable to various loss functions without relying on distributional assumptions.
Contribution
It presents a generic convex optimization-based approach to obtain post-inference guarantees for clustering quality and stability across different criteria.
Findings
Guarantees for K-means and Normalized Cut clustering on real datasets.
Asymptotic instability implies finite sample instability with high probability.
Method does not depend on distributional assumptions, only on data stability.
Abstract
We address the problem of validating the ouput of clustering algorithms. Given data and a partition of these data into clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering . The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Statistical Methods and Inference · Bayesian Methods and Mixture Models
