
TL;DR
The paper proposes partial partial aggregates (PPA), a query optimization technique for distributed engines that reduces data shuffling by pushing only the compute phase of aggregation through joins, leveraging the distributive property.
Contribution
Introducing PPA as a novel optimization that selectively pushes compute aggregation, improving efficiency in distributed query processing with specific join and key configurations.
Findings
PPA reduces unnecessary data shuffles in distributed queries.
Full pushed aggregates can eliminate top-level aggregation in certain join scenarios.
Accurate NDV estimation is crucial for PPA's cost-based decision making.
Abstract
We introduce partial partial aggregates (PPA), a query optimization technique for distributed engines that pushes only the local compute phase of an aggregate operation through joins. A query that aggregates after a join involves two logical operations, each requiring a network shuffle. Pushing a full aggregate (COMPUTEDISTRIBUTEMERGE) below the join introduces a third shuffle. In the specific case where the join key is included in the grouping key and the join is FK-PK, the full pushed aggregate can eliminate the top-level aggregate entirely, making it the preferred choice. In all other key configurations, the top aggregate must remain, and the extra shuffle is wasteful. A PPA pushes only COMPUTE, achieving data reduction before the join without the extra shuffle. The technique relies on the distributive property of aggregates and requires accurate NDV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
