Solving $k$-means on High-dimensional Big Data

Jan-Philipp W. Kappmeier; Daniel R. Schmidt; Melanie Schmidt

arXiv:1502.04265·cs.DS·June 1, 2015·1 cites

Solving $k$-means on High-dimensional Big Data

Jan-Philipp W. Kappmeier, Daniel R. Schmidt, Melanie Schmidt

PDF

Open Access

TL;DR

This paper introduces two novel algorithms, piecy and piecy-mr, designed to efficiently solve high-dimensional $k$-means clustering in data streams, outperforming existing methods in speed and accuracy.

Contribution

The paper presents two new algorithms, piecy and piecy-mr, capable of processing high-dimensional data streams for $k$-means clustering with high accuracy and efficiency, extending previous work.

Findings

01

piecy performs well on high-dimensional data with a medium number of points.

02

piecy-mr effectively handles very long data streams in high dimensions.

03

Experimental results demonstrate the superiority of the new algorithms over existing methods.

Abstract

In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the $k$ -means problem, this has led to the development of several $(1 + ε)$ -approximations (under the assumption that $k$ is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits. We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Data Management and Algorithms · Machine Learning and Data Classification