k-Means for Streaming and Distributed Big Sparse Data

Artem Barger; Dan Feldman

arXiv:1511.08990·cs.DS·February 9, 2016·5 cites

k-Means for Streaming and Distributed Big Sparse Data

Artem Barger, Dan Feldman

PDF

Open Access

TL;DR

This paper introduces a streaming and distributed algorithm for approximating the $k$-means clustering of sparse big data, achieving low memory usage and efficient communication, with proven approximation guarantees and practical improvements.

Contribution

It presents the first streaming algorithm with provable approximation for $k$-means on sparse big data, using a novel sparse coreset of size independent of $d$ and $n$, and demonstrates practical benefits.

Findings

01

Algorithm stores only $O(rac{ ext{log} n}{k^{O(1)}})$ points in memory.

02

Distributed version reduces runtime proportionally with the number of machines.

03

Experimental results show improved clustering performance on real datasets.

Abstract

We provide the first streaming algorithm for computing a provable approximation to the $k$ -means of sparse Big data. Here, sparse Big Data is a set of $n$ vectors in $R^{d}$ , where each vector has $O (1)$ non-zeroes entries, and $d \geq n$ . E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most $lo g n \cdot k^{O (1)}$ input points in memory. If the stream is distributed among $M$ machines, the running time reduces by a factor of $M$ , while communicating a total of $M \cdot k^{O (1)}$ (sparse) input points between the machines. % Our main technical result is a deterministic algorithm for computing a sparse $(k, ϵ)$ -coreset, which is a weighted subset of $k^{O (1)}$ input points that approximates the sum of squared distances from the $n$ input points to every $k$ centers, up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Digital Image Processing Techniques · Stochastic Gradient Optimization Techniques