QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning

Ning Wang; Sainyam Galhotra

arXiv:2512.02444·cs.DB·December 3, 2025

QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning

Ning Wang, Sainyam Galhotra

PDF

Open Access

TL;DR

QJoin is a reinforcement learning framework that improves joinable data discovery by learning transformation strategies and reusing them, leading to higher accuracy and efficiency in heterogeneous data repositories.

Contribution

It introduces a reinforcement learning approach with transfer and reuse mechanisms for transformation-aware join discovery in large, heterogeneous datasets.

Findings

01

Achieves 91.0% F1-score on AutoJoin Web benchmark.

02

Reduces join task runtime by up to 7.4%.

03

Effectively reuses learned transformations for similar tasks.

Abstract

Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins, which assume that join keys match exactly or nearly so. These techniques, while efficient in clean, well-normalized databases, fail in open or federated settings where identifiers are inconsistently formatted, embedded, or split across multiple columns. Approximate or fuzzy joins alleviate minor string variations but cannot capture systematic transformations. We introduce QJoin, a reinforcement-learning framework that learns and reuses transformation strategies across join tasks. QJoin trains an agent under a uniqueness-aware reward that balances similarity with key distinctiveness, enabling it to explore concise, high-value transformation chains. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Database Systems and Queries · Web Data Mining and Analysis