Correlation Sketches for Approximate Join-Correlation Queries
A\'ecio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco,, Juliana Freire

TL;DR
This paper introduces a novel sketching technique and scoring strategies to efficiently identify joinable tables with columns correlated to a query column, enabling scalable data augmentation for analytics and machine learning.
Contribution
It proposes a new sketching method for index construction and correlation estimation, improving the efficiency of join-correlation queries over large datasets.
Findings
Sketches achieve high accuracy in correlation estimation.
Scoring strategies effectively rank tables by correlation quality.
Method scales well with dataset size.
Abstract
The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column and a join column from a query table , retrieve tables in a dataset collection such that is joinable with on and there is a column such that is correlated with . A na\"ive approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
