Approximate Distributed Joins in Apache Spark
Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas,, Ruichuan Chen, Christof Fetzer, Thorsten Strufe

TL;DR
ApproxJoin is an innovative approximate join operator for Apache Spark that combines Bloom filter sketching and stratified sampling to significantly reduce data movement and computation time while maintaining statistical accuracy.
Contribution
The paper introduces ApproxJoin, a novel operator that efficiently approximates distributed joins by integrating Bloom filters and stratified sampling, preserving output quality.
Findings
Achieves 6-9x speedup over standard Spark joins.
Reduces shuffled data volume by 5-82x.
Maintains tight error bounds on join output accuracy.
Abstract
The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output. To realize this potential, we interweave Bloom filter sketching and stratified sampling with the join computation in a new operator, ApproxJoin, that preserves the statistical properties of the join output. ApproxJoin…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Distributed systems and fault tolerance
