Minimally-Supervised Attribute Fusion for Data Lakes

Karamjit Singh; Garima Gupta; Gautam Shroff; and Puneet Agarwal

arXiv:1701.01094·cs.DB·January 5, 2017

Minimally-Supervised Attribute Fusion for Data Lakes

Karamjit Singh, Garima Gupta, Gautam Shroff, and Puneet Agarwal

PDF

Open Access

TL;DR

This paper introduces a minimally-supervised ensemble model that combines Bayesian networks with unsupervised textual matching to automate attribute fusion in data lakes, improving record linkage accuracy with confidence scores.

Contribution

It proposes a novel ensemble approach for attribute fusion that integrates minimal supervision with unsupervised textual matching, enhancing data lake record linkage.

Findings

01

Effective on large real-life datasets from market research

02

Outperforms standard record matching algorithms

03

Provides confidence scores for matches

Abstract

Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product 'SKUs', or follow different norms for categorization. Record linkage techniques can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Mining Algorithms and Applications · Advanced Database Systems and Queries