Improved seeding strategies for k-means and k-GMM
Guillaume Carri\`ere, Fr\'ed\'eric Cazals

TL;DR
This paper introduces improved randomized seeding strategies for k-means and k-GMM clustering, demonstrating consistent performance gains and providing new insights into seeding properties and analysis.
Contribution
It formalizes key aspects of seeding methods, proposes novel lookahead and multipass strategies, and shows their effectiveness over classical approaches.
Findings
Consistent improvement over classical seeding methods in final clustering metrics.
Insights into the relationship between initial seeding and final SSE.
Reduction in variance and sensitivity in iterative seeding methods.
Abstract
We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle--conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization. Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (SSE for k-means, log-likelihood for k-GMM), at a modest overhead. In particular, for k-means, our methods improve on the recently designed multi-swap strategy, which was the first one to outperform the greedy k-means++…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Algorithms and Applications · Advanced Measurement and Detection Methods
Methodsk-Means Clustering
