k-Means for Streaming and Distributed Big Sparse Data
Artem Barger, Dan Feldman

TL;DR
This paper introduces a streaming and distributed algorithm for approximating the $k$-means clustering of sparse big data, achieving low memory usage and efficient communication, with proven approximation guarantees and practical improvements.
Contribution
It presents the first streaming algorithm with provable approximation for $k$-means on sparse big data, using a novel sparse coreset of size independent of $d$ and $n$, and demonstrates practical benefits.
Findings
Algorithm stores only $O(rac{ ext{log} n}{k^{O(1)}})$ points in memory.
Distributed version reduces runtime proportionally with the number of machines.
Experimental results show improved clustering performance on real datasets.
Abstract
We provide the first streaming algorithm for computing a provable approximation to the -means of sparse Big data. Here, sparse Big Data is a set of vectors in , where each vector has non-zeroes entries, and . E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most input points in memory. If the stream is distributed among machines, the running time reduces by a factor of , while communicating a total of (sparse) input points between the machines. % Our main technical result is a deterministic algorithm for computing a sparse -coreset, which is a weighted subset of input points that approximates the sum of squared distances from the input points to every centers, up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Digital Image Processing Techniques · Stochastic Gradient Optimization Techniques
