Statistical Regression to Predict Total Cumulative CPU Usage of MapReduce Jobs
Nikzad Babaii Rizvandi, Javid Taheri, Reza Moraveji, Albert Y. Zomaya

TL;DR
This paper introduces a polynomial regression model to predict total CPU usage of MapReduce jobs based on configuration parameters, aiding resource provisioning and scaling in cloud environments.
Contribution
It presents a novel approach using regression analysis to accurately estimate CPU usage from configuration settings and input data size in MapReduce jobs.
Findings
Prediction accuracy within 8% of actual CPU usage
Model validated on three real-world applications
Influence of input data scaling on CPU usage analyzed
Abstract
Recently, businesses have started using MapReduce as a popular computation framework for processing large amount of data, such as spam detection, and different data mining tasks, in both public and private clouds. Two of the challenging questions in such environments are (1) choosing suitable values for MapReduce configuration parameters e.g., number of mappers, number of reducers, and DFS block size, and (2) predicting the amount of resources that a user should lease from the service provider. Currently, the tasks of both choosing configuration parameters and estimating required resources are solely the users responsibilities. In this paper, we present an approach to provision the total CPU usage in clock cycles of jobs in MapReduce environment. For a MapReduce job, a profile of total CPU usage in clock cycles is built from the job past executions with different values of two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Big Data and Business Intelligence · Data Stream Mining Techniques
