Scalable Initialization Methods for Large-Scale Clustering
Joonas H\"am\"al\"ainen, Tommi K\"arkk\"ainen, Tuomo Rossi

TL;DR
This paper introduces scalable, parallelizable initialization methods for large-scale K-means clustering, leveraging divide-and-conquer and random projections, and demonstrates their superior performance over existing methods on synthetic and real datasets.
Contribution
The paper proposes two novel scalable initialization methods for K-means that utilize divide-and-conquer and multiple subspaces, improving performance on large-scale clustering tasks.
Findings
Proposed methods outperform K-means++ and K-means|| in large-scale experiments.
K-means++ behaves similarly to random initialization in high-dimensional data.
New high-dimensional data generation algorithm for benchmarking.
Abstract
In this work, two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means|| type of an initialization strategy. The second proposal also utilizes multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means|| methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art. We also observe that the currently most popular K-means++…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition · Data Management and Algorithms
Methodsk-Means Clustering
