Minimally-Supervised Attribute Fusion for Data Lakes
Karamjit Singh, Garima Gupta, Gautam Shroff, and Puneet Agarwal

TL;DR
This paper introduces a minimally-supervised ensemble model that combines Bayesian networks with unsupervised textual matching to automate attribute fusion in data lakes, improving record linkage accuracy with confidence scores.
Contribution
It proposes a novel ensemble approach for attribute fusion that integrates minimal supervision with unsupervised textual matching, enhancing data lake record linkage.
Findings
Effective on large real-life datasets from market research
Outperforms standard record matching algorithms
Provides confidence scores for matches
Abstract
Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product 'SKUs', or follow different norms for categorization. Record linkage techniques can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Advanced Database Systems and Queries
