Scaling and Load-Balancing Equi-Joins
Ahmed Metwally

TL;DR
This paper introduces AM-Join, a scalable and load-balanced algorithm for equi-joins in distributed systems, effectively handling hot keys and supporting all outer-join variants with improved performance over existing methods.
Contribution
It proposes a novel Adaptive-Multistage-Join (AM-Join) algorithm that holistically addresses join-skew, supports all outer-join variants, and introduces the Index-Broadcast-Join family for efficient large-table joins.
Findings
AM-Join achieves load balancing during join execution.
The algorithms outperform state-of-the-art in speed and scalability.
Effective handling of skewed data and large tables in distributed environments.
Abstract
The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Caching and Content Delivery
