On data skewness, stragglers, and MapReduce progress indicators

Emilio Coppa; Irene Finocchi

arXiv:1503.09062·cs.DC·April 3, 2015·1 cites

On data skewness, stragglers, and MapReduce progress indicators

Emilio Coppa, Irene Finocchi

PDF

Open Access

TL;DR

This paper introduces NearestFit, a novel progress indicator for MapReduce that accurately predicts job completion times despite data skewness and load imbalance, outperforming existing methods.

Contribution

The paper presents NearestFit, a new profile-guided progress indicator that avoids linear assumptions and uses efficient algorithms to improve accuracy in MapReduce performance prediction.

Findings

01

NearestFit achieves high accuracy in diverse scenarios.

02

It operates with low space and time overheads.

03

It outperforms existing progress indicators like Hadoop's in accuracy.

Abstract

We tackle the problem of predicting the performance of MapReduce applications, designing accurate progress indicators that keep programmers informed on the percentage of completed computation time during the execution of a job. Through extensive experiments, we show that state-of-the-art progress indicators (including the one provided by Hadoop) can be seriously harmed by data skewness, load unbalancing, and straggling tasks. This is mainly due to their implicit assumption that the running time depends linearly on the input size. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption and exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Our theoretical progress model requires fine-grained profile data, that can be very difficult to manage in practice. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Blockchain Technology Applications and Security