Is Simple Uniform Sampling Effective for Center-Based Clustering with Outliers: When and Why?
Jiawei Huang, Wenjie Liu, Hu Ding

TL;DR
This paper demonstrates that simple uniform sampling can effectively solve center-based clustering problems with outliers, especially when the data instance is 'significant', offering theoretical insights and practical advantages.
Contribution
It introduces a novel theoretical framework explaining when uniform sampling is effective for clustering with outliers, supported by experimental validation.
Findings
Uniform sampling's effectiveness depends on the 'significance' of the data instance.
Sample size can be independent of data size and dimension under certain conditions.
Uniform sampling outperforms non-uniform methods in practical scenarios.
Abstract
Real-world datasets often contain outliers, and the presence of outliers can make the clustering problems to be much more challenging. In this paper, we propose a simple uniform sampling framework for solving three representative center-based clustering with outliers problems: -center/median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. To explain the effectiveness of uniform sampling in theory, we introduce a measure of "significance" and prove that the performance of our framework depends on the significance degree of the given instance. In particular, the sample size can be independent of the input data size and the dimensionality , if we assume the given instance is "significant", which is in fact a fairly reasonable assumption in practice. Due to its simplicity, the uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Anomaly Detection Techniques and Applications · Survey Sampling and Estimation Techniques
