Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts
Duarte M. Nascimento, Miguel Ferreira, Miguel L. Pardal

TL;DR
This study compares the performance of Spark and Unicage for big data processing, finding Unicage faster for search tasks and Spark better for complex dependencies, highlighting trade-offs between complexity and performance.
Contribution
It provides a performance comparison between Spark and Unicage, demonstrating their respective strengths and limitations in handling large-scale data processing tasks.
Findings
Unicage outperforms Spark in search workloads like grep and select.
Spark handles workloads with inter-record dependencies, such as sort and join, more effectively.
Performance varies significantly based on workload type and system configuration.
Abstract
The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
