SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows
Astrid Rheinl\"ander, Arvid Heise, Fabian Hueske, Ulf Leser, Felix, Naumann

TL;DR
SOFA is an extensible optimizer designed for UDF-heavy dataflows, using semantic properties and rewrite rules to generate more efficient execution plans, significantly outperforming existing algorithms.
Contribution
It introduces a novel, extensible optimization framework that effectively handles UDF-heavy dataflows through semantic properties and a subsumption hierarchy.
Findings
SOFA finds plans up to 6 times more efficient.
It outperforms three other optimization algorithms.
The approach is effective across multiple dataflow domains.
Abstract
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (UDFs). However, the heavy use of UDFs is not well taken into account for dataflow optimization in current systems. SOFA is a novel and extensible optimizer for UDF-heavy dataflows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: We arrange user-defined operators and their properties into a subsumption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Cloud Computing and Resource Management · Advanced Data Storage Technologies
