Fast Clustering using MapReduce

Alina Ene; Sungjin Im; Benjamin Moseley

arXiv:1109.1579·cs.DC·September 9, 2011·1 cites

Fast Clustering using MapReduce

Alina Ene, Sungjin Im, Benjamin Moseley

PDF

Open Access

TL;DR

This paper introduces fast, practical MapReduce algorithms for large-scale clustering problems like k-center and k-median, with theoretical guarantees and competitive empirical performance.

Contribution

It presents the first analysis of clustering algorithms in the MapReduce class , develops sampling-based algorithms with constant approximation, and demonstrates their efficiency and effectiveness through experiments.

Findings

01

Algorithms run in a constant number of MapReduce rounds.

02

Solutions are comparable or better than existing algorithms.

03

Algorithms are faster on large datasets compared to tested parallel methods.

Abstract

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$ -center and $k$ -median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $M R C^{0}$ , a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Graph Theory and Algorithms