Characterizing BigBench queries, Hive, and Spark in multi-cloud environments
Nicolas Poggi, Alejandro Montero, David Carrera

TL;DR
This paper benchmarks and characterizes BigBench queries and the performance of Hive and Spark in multi-cloud PaaS environments, providing insights into resource requirements, scalability, and configuration needs.
Contribution
It offers a comprehensive analysis of BigBench queries and compares Hive and Spark performance across major cloud providers, highlighting configuration impacts and resource consumption.
Findings
Hive and Spark have distinct resource usage patterns.
Performance varies significantly with data scale and configuration.
Proper tuning is essential for optimal performance in cloud environments.
Abstract
BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases -- queries -- which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
