Optimal Predicate Pushdown Synthesis
Robert Zhang, Eric Hayden Campbell, Dixin Tang, Isil Dillig

TL;DR
This paper presents a semantic foundation and an automated synthesis framework for predicate pushdown in data pipelines, significantly improving performance by optimizing filter placement in complex workflows.
Contribution
It introduces a formal semantic basis and a synthesis algorithm for optimal predicate pushdown, implemented in the Pusharoo tool, applicable to real-world data processing pipelines.
Findings
Pusharoo produces optimal pushdown transformations in median 1.6 seconds.
Discovered pushdowns speed up pipelines by an average of 2.4×.
The approach is more expressive than prior work, handling complex UDFs.
Abstract
Predicate pushdown is a long-standing performance optimization that filters data as early as possible in a computational workflow. In modern data pipelines, this transformation is especially important because much of the computation occurs inside user-defined functions (UDFs) written in general-purpose languages such as Python and Scala. These UDFs capture rich domain logic and complex aggregations and are among the most expensive operations in a pipeline. Moving filters ahead of such UDFs can yield substantial performance gains, but doing so requires semantic reasoning. This paper introduces a general semantic foundation for predicate pushdown over stateful fold-based computations. We view pushdown as a correspondence between two programs that process different subsets of input data, with correctness witnessed by a bisimulation invariant relating their internal states. Building on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
