Gradual Machine Learning for Entity Resolution

Boyi Hou; Qun Chen; Yanyan Wang; Youcef Nafa; Zhanhuai Li

arXiv:1810.12125·cs.DB·June 17, 2019

Gradual Machine Learning for Entity Resolution

Boyi Hou, Qun Chen, Yanyan Wang, Youcef Nafa, Zhanhuai Li

PDF

Open Access

TL;DR

This paper introduces a novel gradual machine learning paradigm for entity resolution that reduces the need for manual labeling by iteratively labeling easy and then more challenging instances, achieving competitive results.

Contribution

It proposes a new learning approach that automatically labels data in stages, minimizing manual effort and outperforming unsupervised methods while competing with supervised techniques.

Findings

01

Outperforms unsupervised methods in entity resolution

02

Achieves results comparable to supervised state-of-the-art

03

Reduces manual labeling effort significantly

Abstract

Usually considered as a classification problem, entity resolution (ER) can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine labeling without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances by iterative factor graph inference. In gradual machine learning, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Anomaly Detection Techniques and Applications · Privacy-Preserving Technologies in Data