Large Scale Record Linkage in the Presence of Missing Data
Thilina Ranbaduge, Peter Christen, Rainer Schnell

TL;DR
This paper introduces a novel large-scale record linkage method that effectively handles missing data, errors, and variations by using attribute and relational signatures, improving accuracy and scalability in real-world databases.
Contribution
It presents a new technique combining attribute and relational signatures for accurate, scalable record linkage despite missing or erroneous QID values.
Findings
Achieves high linkage quality with substantial missing data.
Demonstrates scalability on large real-world databases.
Outperforms traditional methods in error-prone scenarios.
Abstract
Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data
