Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing
Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert,, Odej Kao

TL;DR
Crispy is a resource allocation tool that efficiently predicts memory needs for large-scale data processing jobs using minimal profiling, significantly reducing costs without requiring extensive job history or full test runs.
Contribution
Crispy introduces a novel profiling-based approach for resource allocation that works effectively on unique jobs with minimal overhead, unlike prior methods.
Findings
Reduced job execution costs by 56%
Profiling takes less than ten minutes on average
Effective for diverse, non-recurring jobs
Abstract
Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Data Stream Mining Techniques · IoT and Edge/Fog Computing
