Opening the Black Boxes in Data Flow Optimization
Fabian Hueske, Mathias Peters, Matthias Sax, Astrid Rheinl\"ander,, Rico Bergmann, Aljoscha Krettek, Kostas Tzoumas

TL;DR
This paper presents a novel data flow optimizer that can reorder operators without knowing their semantics by analyzing user-defined functions, improving optimization in big data systems.
Contribution
It introduces a method to perform data flow optimization using minimal properties of operators, enabling reordering without full algebraic semantics knowledge.
Findings
The optimizer can reorder operators like selection and join in black box data flows.
It achieves similar rewriting power as relational DBMS optimizers.
It can optimize non-relational data flows, a unique capability.
Abstract
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Advanced Database Systems and Queries
