On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and Cost-Predictable k-means
Yushuai Ji, Zepeng Liu, Sheng Wang, Yuan Sun, Zhiyong Peng

TL;DR
This paper introduces Dask-means, a novel k-means algorithm that is fast, memory-efficient, and cost-predictable, suitable for large-scale spatial data and resource-constrained devices.
Contribution
We propose Dask-means, which accelerates k-means with a memory-tunable index and a lightweight cost estimator for resource-aware execution.
Findings
Uses less than 30MB memory on large datasets
Achieves over 168x speedup over Lloyd's algorithm
Demonstrates significant speedup and low memory on mobile devices
Abstract
The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been developed to achieve high performance with significant computational resources, such as memory and CPU usage time. These algorithms, though effective, are not well-suited for resource-constrained devices. In this paper, we propose a fast, memory-efficient, and cost-predictable k-means called Dask-means. We first accelerate k-means by designing a memory-efficient accelerator, which utilizes an optimized nearest neighbor search over a memory-tunable index to assign spatial vectors to clusters in batches. We then design a lightweight cost estimator to predict the memory cost and runtime of the k-means task, allowing it to request appropriate memory from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Mining Algorithms and Applications · Geographic Information Systems Studies
