Performance Benefits of DataMPI: A Case Study with BigDataBench
Fan Liang, Chen Feng, Xiaoyi Lu, Zhiwei Xu

TL;DR
This paper demonstrates that DataMPI, a communication library extending MPI for Big Data, significantly improves performance and resource efficiency over Hadoop and Spark in Big Data processing tasks.
Contribution
The paper presents a comprehensive performance analysis of DataMPI using BigDataBench, showing its superior speed and resource utilization compared to Hadoop and Spark.
Findings
DataMPI achieves up to 55% speedup over Hadoop.
DataMPI achieves up to 39% speedup over Spark.
DataMPI uses more efficient communication and resource management.
Abstract
Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and industrial communities to adopt state-of-the-art technologies in HPC to solve Big Data problems. Recently, we have proposed a key-value pair based communication library, DataMPI, which is extending MPI to support Hadoop/Spark-like Big Data Computing jobs. In this paper, we use BigDataBench, a Big Data benchmark suite, to do comprehensive studies on performance and resource utilization characterizations of Hadoop, Spark and DataMPI. From our experiments, we observe that the job execution time of DataMPI has up to 55% and 39% speedups compared with those of Hadoop and Spark, respectively. Most of the benefits come from the high-efficiency communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Graph Theory and Algorithms
