Semi-supervised clustering for de-duplication
Shrinu Kushagra, Shai Ben-David, Ihab Ilyas

TL;DR
This paper introduces a semi-supervised clustering framework for data de-duplication, proving NP-hardness of the problem even with oracle assistance, and proposes an algorithm for a restricted version with success guarantees.
Contribution
It formalizes promise correlation clustering, proves its NP-hardness with limited oracle queries, and offers a semi-supervised algorithm for a restricted clustering class.
Findings
NP-hardness of promise correlation clustering with oracle access
Limited oracle queries do not simplify the problem
Proposed semi-supervised algorithm achieves success guarantees for a restricted class
Abstract
Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph with the edges labelled and , the goal is to find a clustering that minimizes the number of edges within a cluster plus the number of edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph with edges corresponding to points in the same cluster being labelled and other edges being labelled . Under the promise that the edge difference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data-Driven Disease Surveillance · Privacy-Preserving Technologies in Data
