Accelerating Approximate Analytical Join Queries over Unstructured Data with Statistical Guarantees
Yuxuan Zhu, Tengjun Jin, Chenghao Mo, Daniel Kang

TL;DR
This paper introduces BaS, a novel method combining blocking and sampling to efficiently and accurately perform approximate join queries over unstructured data with statistical guarantees.
Contribution
BaS optimally combines embedding-based blocking and sampling, providing statistical guarantees and improved efficiency for approximate join queries.
Findings
BaS achieves up to 19× reduction in estimation error.
BaS provides valid confidence intervals in real-world datasets.
BaS asymptotically outperforms or matches standalone sampling.
Abstract
Analytical join queries over unstructured data are increasingly prevalent in data analytics. Applying machine learning (ML) models to label every pair in the cross product of tables can achieve state-of-the-art accuracy, but the cost of pairwise execution of ML models is prohibitive. Existing algorithms, such as embedding-based blocking and sampling, aim to reduce this cost. However, they either fail to provide statistical guarantees (leading to errors up to 79% higher than expected) or become as inefficient as uniform sampling. We propose blocking-augmented sampling (BaS), which simultaneously achieves statistical guarantees and high efficiency. BaS optimally orchestrates embedding-based blocking and sampling to mitigate their respective limitations. Specifically, BaS allocates data tuples in the cross product into two regimes based on the failure modes of embeddings. In the regime…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Cloud Computing and Resource Management
