Partial k-means to avoid outliers, mathematical programming formulations, complexity results
Nicolas Dupin, Frank Nielsen

TL;DR
This paper introduces a Partial k-means variant to handle outliers, providing mathematical programming formulations, complexity analysis, and efficient solutions for one-dimensional cases.
Contribution
It extends MSSC by considering outliers, offers integer programming formulations, and analyzes complexity, including polynomial solutions for 1D cases.
Findings
PMSSC is NP-hard in Euclidean space for dimensions > 2
Unweighted PMSSC is polynomial in 1D and solved with dynamic programming
Weighted PMSSC has a weaker optimality property, complexity remains open
Abstract
A well-known bottleneck of Min-Sum-of-Square Clustering (MSSC, the celebrated -means problem) is to tackle the presence of outliers. In this paper, we propose a Partial clustering variant termed PMSSC which considers a fixed number of outliers to remove. We solve PMSSC by Integer Programming formulations and complexity results extending the ones from MSSC are studied. PMSSC is NP-hard in Euclidean space when the dimension or the number of clusters is greater than . Finally, one-dimensional cases are studied: Unweighted PMSSC is polynomial in that case and solved with a dynamic programming algorithm, extending the optimality property of MSSC with interval clustering. This result holds also for unweighted -medoids with outliers. A weaker optimality property holds for weighted PMSSC, but NP-hardness or not remains an open question in dimension one.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models
