The Efficiency of MapReduce in Parallel External Memory

Gero Greiner; Riko Jacob

arXiv:1112.3765·cs.DC·December 19, 2011

The Efficiency of MapReduce in Parallel External Memory

Gero Greiner, Riko Jacob

PDF

Open Access

TL;DR

This paper provides theoretical bounds on the I/O complexity of the MapReduce framework within the parallel external memory model, focusing on the shuffle step, and compares its efficiency to other models.

Contribution

It establishes upper and lower bounds for the shuffle step's I/O complexity in MapReduce, linking practical performance to theoretical models.

Findings

01

Matching upper and lower bounds for shuffle step I/O complexity

02

MapReduce's I/O efficiency can be bounded and compared to PEM and BSP models

03

Results show the potential performance loss in MapReduce relative to optimal algorithms

Abstract

Since its introduction in 2004, the MapReduce framework has become one of the standard approaches in massive distributed and parallel computation. In contrast to its intensive use in practise, theoretical footing is still limited and only little work has been done yet to put MapReduce on a par with the major computational models. Following pioneer work that relates the MapReduce framework with PRAM and BSP in their macroscopic structure, we focus on the functionality provided by the framework itself, considered in the parallel external memory model (PEM). In this, we present upper and lower bounds on the parallel I/O-complexity that are matching up to constant factors for the shuffle step. The shuffle step is the single communication phase where all information of one MapReduce invocation gets transferred from map workers to reduce workers. Hence, we move the focus towards the internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies