Ground Truth Inference for Weakly Supervised Entity Matching
Renzhi Wu, Alexander Bendeck, Xu Chu, Yeye He

TL;DR
This paper introduces a novel weak supervision labeling model tailored for entity matching that leverages labeling functions and enforces transitivity, significantly reducing labeling effort while maintaining high accuracy.
Contribution
It proposes a simple yet effective labeling model for weak supervision in entity matching, incorporating transitivity constraints to improve label quality and performance.
Findings
Outperforms existing weak supervision methods across ten datasets.
Achieves 9% higher F1 score on average compared to previous methods.
Enables training deep learning EM models with substantially less labeled data.
Abstract
Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching performance; however, they require many labeled examples, which are often expensive or infeasible to obtain. This has inspired us to approach data labeling for EM using weak supervision. In particular, we use the labeling function abstraction popularized by Snorkel, where each labeling function (LF) is a user-provided program that can generate many noisy match/non-match labels quickly and cheaply. Given a set of user-written LFs, the quality of data labeling depends on a labeling model to accurately infer the ground-truth labels. In this work, we first propose a simple but powerful labeling model for general weak supervision tasks. Then, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Data Stream Mining Techniques
