How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

TL;DR
This paper investigates how increasing data volume impacts Spark-based data analytics on a scale-up server, revealing bottlenecks related to memory, I/O, and garbage collection that affect performance.
Contribution
It provides a detailed analysis of Spark performance on scale-up servers, highlighting the effects of data volume and proposing optimization strategies for memory management.
Findings
Spark analytics are DRAM bound and do not benefit from more than 12 cores.
Performance degrades with larger data due to I/O and garbage collection delays.
Memory behavior matching with garbage collection improves application performance by 1.6x to 3x.
Abstract
Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10\% better instruction retirement rate (due to lower L1 cache misses and higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
