Large Scale Record Linkage in the Presence of Missing Data

Thilina Ranbaduge; Peter Christen; Rainer Schnell

arXiv:2104.09677·cs.DB·April 21, 2021

Large Scale Record Linkage in the Presence of Missing Data

Thilina Ranbaduge, Peter Christen, Rainer Schnell

PDF

Open Access

TL;DR

This paper introduces a novel large-scale record linkage method that effectively handles missing data, errors, and variations by using attribute and relational signatures, improving accuracy and scalability in real-world databases.

Contribution

It presents a new technique combining attribute and relational signatures for accurate, scalable record linkage despite missing or erroneous QID values.

Findings

01

Achieves high linkage quality with substantial missing data.

02

Demonstrates scalability on large real-world databases.

03

Outperforms traditional methods in error-prone scenarios.

Abstract

Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data