Sparkle: Optimizing Spark for Large Memory Machines and Analytics
Mijung Kim, Jun Li, Haris Volos, Manish Marwah, Alexander Ulanov,, Kimberly Keeton, Joseph Tucek, Lucy Cherkasova, Le Xu, and Pradeep Fernando

TL;DR
Sparkle enhances Spark's performance on large memory machines by replacing TCP/IP shuffle with shared memory and using off-heap storage, significantly accelerating iterative and graph workloads.
Contribution
It introduces a shared memory shuffle engine and off-heap memory store to optimize Spark's memory usage and performance on large memory systems.
Findings
Shared memory shuffle speeds up Spark by up to 6x.
Off-heap store yields over 20x improvement on graph workloads.
Performance gains are consistent across scale-up and scale-out environments.
Abstract
Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable data are used for storing data updates in each iteration, making it inefficient for long running, iterative workloads. A non-deterministic garbage collector further worsens this problem. Sparkle is a library that optimizes memory usage in Spark. It exploits large shared memory to achieve better data shuffling and intermediate storage. Sparkle replaces the current TCP/IP-based shuffle with a shared memory approach and proposes an off-heap memory store for efficient updates. We performed a series of experiments on scale-out clusters and scale-up machines. The optimized shuffle engine leveraging shared memory provides 1.3x to 6x faster performance relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
