RHEEMix in the Data Jungle: A Cost-based Optimizer for Cross-platform Systems
Sebastian Kruse, Zoi Kaoudi, Bertty Contreras, Sanjay Chawla, Felix, Naumann, Jorge-Arnulfo Quian\'e-Ruiz

TL;DR
This paper introduces Rheem's cost-based optimizer that intelligently assigns data analytic subtasks to the most suitable platforms, significantly improving efficiency and scalability in cross-platform data analytics.
Contribution
It presents a novel graph-based optimization approach with efficient plan enumeration for cross-platform data processing, enabling better platform selection and task execution speed.
Findings
Optimizer can select the most efficient platform combination.
Tasks run more than one order of magnitude faster with multiple platforms.
Extensive evaluation shows significant performance improvements.
Abstract
In pursuit of efficient and scalable data analytics, the insight that "one size does not fit all" has given rise to a plethora of specialized data processing platforms and today's complex data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i)~a mechanism based on graph transformations to explore alternative execution strategies; (ii)~a novel graph-based approach to efficiently plan data movement among subtasks and platforms; and (iii)~an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. The results show that our optimizer is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
