Benchmarking Resource Usage of Underlying Datatypes of Apache Spark

Brittany Nicholls; Mariama Adangwa; Rachel Estes; Hugues Nelson; Iradukunda; Qingquan Zhang; Ting Zhu

arXiv:2012.04192·eess.SY·December 9, 2020

Benchmarking Resource Usage of Underlying Datatypes of Apache Spark

Brittany Nicholls, Mariama Adangwa, Rachel Estes, Hugues Nelson, Iradukunda, Qingquan Zhang, Ting Zhu

PDF

TL;DR

This paper evaluates how different underlying datatypes in Apache Spark affect resource usage, proposing resource metrics like peak execution memory as more reliable benchmarks than execution time.

Contribution

It introduces resource-based benchmarking of Spark datatypes, highlighting the limitations of execution time as a reproducible metric.

Findings

01

Resource usage varies significantly across datatypes.

02

Peak execution memory is a reliable benchmarking metric.

03

Different applications show distinct resource consumption patterns.

Abstract

The purpose of this paper is to examine how resource usage of an analytic is affected by the different underlying datatypes of Spark analytics - Resilient Distributed Datasets (RDDs), Datasets, and DataFrames. The resource usage of an analytic is explored as a viable and preferred alternative of benchmarking big data analytics instead of the current common benchmarking performed using execution time. The run time of an analytic is shown to not be guaranteed to be a reproducible metric since many external factors to the job can affect the execution time. Instead, metrics readily available through Spark including peak execution memory are used to benchmark the resource usage of these different datatypes in common applications of Spark analytics, such as counting, caching, repartitioning, and KMeans.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.