Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
Yuhang Zhang, Kee Siong Ng, Michael Walker, Pauline Chou, Tania, Churchill, Peter Christen

TL;DR
This paper introduces a scalable, probabilistic signature-based entity resolution algorithm designed for parallel databases, demonstrating state-of-the-art accuracy and efficiency on benchmark datasets, suitable for large industrial datasets.
Contribution
It presents a novel probabilistic signature technique for entity resolution that is scalable and easily implementable on modern parallel database systems.
Findings
Achieves state-of-the-art accuracy on benchmark datasets
Demonstrates scalability and efficiency in parallel database environments
Easily deployable in large industrial applications
Abstract
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record-linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results. The proposed algorithm can be implemented simply on modern parallel databases, which allows it to be deployed with relative ease in large industrial applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Web Data Mining and Analysis
