Semi-supervised clustering for de-duplication

Shrinu Kushagra; Shai Ben-David; Ihab Ilyas

arXiv:1810.04361·cs.LG·May 27, 2020

Semi-supervised clustering for de-duplication

Shrinu Kushagra, Shai Ben-David, Ihab Ilyas

PDF

Open Access

TL;DR

This paper introduces a semi-supervised clustering framework for data de-duplication, proving NP-hardness of the problem even with oracle assistance, and proposes an algorithm for a restricted version with success guarantees.

Contribution

It formalizes promise correlation clustering, proves its NP-hardness with limited oracle queries, and offers a semi-supervised algorithm for a restricted clustering class.

Findings

01

NP-hardness of promise correlation clustering with oracle access

02

Limited oracle queries do not simplify the problem

03

Proposed semi-supervised algorithm achieves success guarantees for a restricted class

Abstract

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labelled $0$ and $1$ , the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^{*}$ with edges corresponding to points in the same cluster being labelled $0$ and other edges being labelled $1$ . Under the promise that the edge difference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data-Driven Disease Surveillance · Privacy-Preserving Technologies in Data