Seeding K-Means using Method of Moments
Sayantan Dasgupta

TL;DR
This paper introduces a novel seeding method for K-means clustering using higher order moments, which reduces initialization passes to one and guarantees a cost close to optimal, improving efficiency for large datasets.
Contribution
The paper presents a new seeding technique based on higher order moments that requires only one dataset pass and offers provable near-optimal clustering cost.
Findings
Requires only one pass through data for seeding
Guarantees final cost within O(√K) of optimal
Outperforms existing seeding methods on benchmarks
Abstract
K-means is one of the most widely used algorithms for clustering in Data Mining applications, which attempts to minimize the sum of the square of the Euclidean distance of the points in the clusters from the respective means of the clusters. However, K-means suffers from local minima problem and is not guaranteed to converge to the optimal cost. K-means++ tries to address the problem by seeding the means using a distance-based sampling scheme. However, seeding the means in K-means++ needs sequential passes through the entire dataset, and this can be very costly for large datasets. Here we propose a method of seeding the initial means based on factorizations of higher order moments for bounded data. Our method takes passes through the entire dataset to extract the initial set of means, and its final cost can be proven to be within of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Data Management and Algorithms
