TL;DR
This paper compares Apache Spark and Hadoop MapReduce for classification tasks in Big Data, evaluating performance, accuracy, and scalability to guide tool selection.
Contribution
It provides a comprehensive benchmark using multiple metrics, including execution time, accuracy, and scalability, which is novel in considering task-specific performance.
Findings
Spark is 5 times faster than MapReduce in training.
Spark's performance degrades with larger input workloads.
MapReduce achieves slightly better accuracy (~3%) than Spark.
Abstract
Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term Big Data Mining. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
