On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and   Cost-Predictable k-means

Yushuai Ji; Zepeng Liu; Sheng Wang; Yuan Sun; Zhiyong Peng

arXiv:2412.02244·cs.LG·December 4, 2024

On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and Cost-Predictable k-means

Yushuai Ji, Zepeng Liu, Sheng Wang, Yuan Sun, Zhiyong Peng

PDF

Open Access 1 Repo

TL;DR

This paper introduces Dask-means, a novel k-means algorithm that is fast, memory-efficient, and cost-predictable, suitable for large-scale spatial data and resource-constrained devices.

Contribution

We propose Dask-means, which accelerates k-means with a memory-tunable index and a lightweight cost estimator for resource-aware execution.

Findings

01

Uses less than 30MB memory on large datasets

02

Achieves over 168x speedup over Lloyd's algorithm

03

Demonstrates significant speedup and low memory on mobile devices

Abstract

The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been developed to achieve high performance with significant computational resources, such as memory and CPU usage time. These algorithms, though effective, are not well-suited for resource-constrained devices. In this paper, we propose a fast, memory-efficient, and cost-predictable k-means called Dask-means. We first accelerate k-means by designing a memory-efficient accelerator, which utilizes an optimized nearest neighbor search over a memory-tunable index to assign spatial vectors to clusters in batches. We then design a lightweight cost estimator to predict the memory cost and runtime of the k-means task, allowing it to request appropriate memory from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

notnnorth/dask-means-cpp
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Data Mining Algorithms and Applications · Geographic Information Systems Studies