Get Your Memory Right: The Crispy Resource Allocation Assistant for   Large-Scale Data Processing

Jonathan Will; Lauritz Thamsen; Jonathan Bader; Dominik Scheinert,; Odej Kao

arXiv:2206.13852·cs.DC·January 11, 2023

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert,, Odej Kao

PDF

Open Access 1 Repo

TL;DR

Crispy is a resource allocation tool that efficiently predicts memory needs for large-scale data processing jobs using minimal profiling, significantly reducing costs without requiring extensive job history or full test runs.

Contribution

Crispy introduces a novel profiling-based approach for resource allocation that works effectively on unique jobs with minimal overhead, unlike prior methods.

Findings

01

Reduced job execution costs by 56%

02

Profiling takes less than ten minutes on average

03

Effective for diverse, non-recurring jobs

Abstract

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dos-group/crispy
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Data Stream Mining Techniques · IoT and Edge/Fog Computing