Graph-based hierarchical record clustering for unsupervised entity resolution
Islam Akef Ebeid, John R. Talburt, Md Abdus Salam Siddique

TL;DR
This paper introduces a graph-based hierarchical clustering method for unsupervised entity resolution, improving speed and accuracy over existing probabilistic frameworks by leveraging graph theory algorithms.
Contribution
The paper proposes a novel two-step graph-based clustering approach (GDWM) that enhances the Data Washing Machine framework with hierarchical clustering and transitive closure for better performance.
Findings
Significant speed-up over previous methods
Increased precision and F1 scores
Effective on multiple synthetic datasets
Abstract
Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure algorithm utilized in the DWM. That is followed by breaking down the discovered soft clusters into more precise entity clusters in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data-Driven Disease Surveillance
