Fuzzy Integration of Data Lake Tables

Aamod Khatiwada; Roee Shraga; Ren\'ee J. Miller

arXiv:2501.09211·cs.DB·January 17, 2025

Fuzzy Integration of Data Lake Tables

Aamod Khatiwada, Roee Shraga, Ren\'ee J. Miller

PDF

Open Access

TL;DR

This paper introduces a fuzzy extension to the Full Disjunction operator for data integration, allowing approximate matches to improve dataset unification without significant performance loss.

Contribution

It proposes a novel data-driven method to incorporate fuzzy matching into Full Disjunction, addressing limitations of exact matching in diverse, real-world datasets.

Findings

01

Fuzzy Full Disjunction enhances data integration effectiveness.

02

The approach incurs minimal additional computational overhead.

03

Experimental results confirm improved dataset unification.

Abstract

Data integration is an important step in any data science pipeline where the objective is to unify the information available in different datasets for comprehensive analysis. Full Disjunction, which is an associative extension of the outer join operator, has been shown to be an effective operator for integrating datasets. It fully preserves and combines the available information. Existing Full Disjunction algorithms only consider the equi-join scenario where only tuples having the same value on joining columns are integrated. This, however, does not realistically represent an open data scenario, where datasets come from diverse sources with inconsistent values (e.g., synonyms, abbreviations, etc.) and with limited metadata. So, joining just on equal values severely limits the ability of Full Disjunction to fully combine datasets. Thus, in this work, we propose an extension of Full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Data Mining Algorithms and Applications · Data Management and Algorithms