Too Much Information Kills Information: A Clustering Perspective
Yicheng Xu, Vincent Chau, Chenchen Wu, Yong Zhang, Vassilis, Zissimopoulos, Yifei Zou

TL;DR
This paper introduces a novel variance-based clustering approach that uses only a small subset of data, achieving comparable or better results than traditional methods with significantly less information.
Contribution
The paper proposes a sampling-based clustering method that efficiently estimates optimal clusters using minimal dataset information, with theoretical guarantees and practical improvements.
Findings
Achieves competitive clustering quality with only 7% data information.
Outperforms k-means and k-means++ with up to 15% data information in most cases.
Provides a balanced clustering extension based on the same principle.
Abstract
Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means++ method (Arthur and Vassilvitskii 2007), we only need 7% information of the dataset. If we have up to 15% information of the dataset, then our algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Face and Expression Recognition
