Comparative analysis of large data processing in Apache Spark using Java, Python and Scala
Ivan Borodii, Illia Fedorovych, Halyna Osukhivska, Diana Velychko, Roman Butsii

TL;DR
This study compares the performance of Java, Python, and Scala in processing large datasets with Apache Spark, revealing how programming language choice impacts efficiency across different data sizes and operations.
Contribution
It provides a comprehensive comparison of full ETL workflows in Spark across three programming languages using Apache Iceberg, which was limited in prior research.
Findings
Python outperforms Java and Scala on small datasets.
Scala and Java are more efficient for large and complex data processing.
Performance varies significantly with data size and operation complexity.
Abstract
During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Data Analysis with R · Scientific Computing and Data Management
