Performance Evaluation of Distributed Computing Environments with Hadoop and Spark Frameworks
Vladyslav Taran, Oleg Alienin, Sergii Stirenko, A.Rojbi, and Yuri, Gordienko

TL;DR
This paper evaluates the performance of Hadoop and Spark distributed computing frameworks on real and virtual clusters using word counting tasks, revealing that processing times grow rapidly with data size and highlighting implications for Big Data applications.
Contribution
It provides a comparative analysis of Hadoop and Spark performance on real and virtual clusters, emphasizing the impact of data size on processing efficiency and scalability.
Findings
Processing times grow faster than a power function with data size.
Speedup decreases significantly as dataset size increases.
Virtual clusters show more pronounced performance degradation.
Abstract
Recently, due to rapid development of information and communication technologies, the data are created and consumed in the avalanche way. Distributed computing create preconditions for analyzing and processing such Big Data by distributing the computations among a number of compute nodes. In this work, performance of distributed computing environments on the basis of Hadoop and Spark frameworks is estimated for real and virtual versions of clusters. As a test task, we chose the classic use case of word counting in texts of various sizes. It was found that the running times grow very fast with the dataset size and faster than a power function even. As to the real and virtual versions of cluster implementations, this tendency is the similar for both Hadoop and Spark frameworks. Moreover, speedup values decrease significantly with the growth of dataset size, especially for virtual version…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
