Efficiently Transforming Tables for Joinability
Arash Dargahi Nobari, Davood Rafiei

TL;DR
This paper presents an efficient algorithm for transforming and joining textual data from different sources despite formatting differences, significantly improving speed over existing methods.
Contribution
We develop a novel algorithm that efficiently finds transformations to join mismatched textual data, outperforming state-of-the-art approaches in coverage and speed.
Findings
Algorithm covers all transformations of the baseline approach.
Our method is several orders of magnitude faster.
Effective on both real and synthetic datasets.
Abstract
Data from different sources rarely conform to a single formatting even if they describe the same set of entities, and this raises concerns when data from multiple sources must be joined or cross-referenced. Such a formatting mismatch is unavoidable when data is gathered from various public and third-party sources. Commercial database systems are not able to perform the join when there exist differences in data representation or formatting, and manual reformatting is both time consuming and error-prone. We study the problem of efficiently joining textual data under the condition that the join columns are not formatted the same and cannot be equi-joined, but they become joinable under some transformations. The problem is challenging simply because the number of possible transformations explodes with both the length of the input and the number of rows, even if each transformation is formed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Quality and Management · Semantic Web and Ontologies
