QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning
Ning Wang, Sainyam Galhotra

TL;DR
QJoin is a reinforcement learning framework that improves joinable data discovery by learning transformation strategies and reusing them, leading to higher accuracy and efficiency in heterogeneous data repositories.
Contribution
It introduces a reinforcement learning approach with transfer and reuse mechanisms for transformation-aware join discovery in large, heterogeneous datasets.
Findings
Achieves 91.0% F1-score on AutoJoin Web benchmark.
Reduces join task runtime by up to 7.4%.
Effectively reuses learned transformations for similar tasks.
Abstract
Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins, which assume that join keys match exactly or nearly so. These techniques, while efficient in clean, well-normalized databases, fail in open or federated settings where identifiers are inconsistently formatted, embedded, or split across multiple columns. Approximate or fuzzy joins alleviate minor string variations but cannot capture systematic transformations. We introduce QJoin, a reinforcement-learning framework that learns and reuses transformation strategies across join tasks. QJoin trains an agent under a uniqueness-aware reward that balances similarity with key distinctiveness, enabling it to explore concise, high-value transformation chains. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Web Data Mining and Analysis
