An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application
Fouad Khan

TL;DR
This paper introduces a simple seed selection method for k-means clustering of georeferenced data that significantly enhances replicability and efficiency without adding parameters, outperforming existing methods.
Contribution
A novel seed selection algorithm based on dataset gaps improves k-means replicability and efficiency, with no extra parameters needed.
Findings
Replicability of cluster assignments improved by up to 100%.
Variance over different runs reduced to nearly zero.
Method is more computationally efficient than k-means++.
Abstract
K-means is one of the most widely used clustering algorithms in various disciplines, especially for large datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers. K-means++ has been proposed to overcome this problem and has been shown to have better accuracy and computational efficiency than k-means. In many clustering problems though -such as when classifying georeferenced data for mapping applications- standardization of clustering methodology, specifically, the ability to arrive at the same cluster assignment for every run of the method i.e. replicability of the methodology, may be of greater significance than any perceived measure of accuracy, especially when the solution is known to be non-unique, as in the case of k-means clustering. Here we propose a simple initial seed selection algorithm for k-means clustering along one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodsk-Means Clustering
