Efficient sorting, duplicate removal, grouping, and aggregation
Thanh Do, Goetz Graefe, Jeffrey Naughton

TL;DR
This paper introduces a new sorting-based algorithm for duplicate removal, grouping, and aggregation that outperforms traditional methods, ensuring optimal performance regardless of input size or sorting requirements, and is used in Google's production systems.
Contribution
The paper presents a novel algorithm that always performs at least as well as existing methods and can replace multiple algorithms in database query processing.
Findings
New algorithm outperforms traditional methods in all scenarios.
Used in Google's production workloads for large-scale data aggregation.
Produces sorted output to accelerate subsequent database operations.
Abstract
Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., TPC-H Query 1), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join. Unfortunately, the size information required…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Cloud Computing and Resource Management
