Multifile Partitioning for Record Linkage and Duplicate Detection
Serge Aleshin-Guendel, Mauricio Sadinle

TL;DR
This paper introduces a Bayesian method for multifile record linkage and duplicate detection, addressing complex data merging scenarios with overlapping entities and potential duplicates, using a novel partition model and flexible loss functions.
Contribution
It presents a new Bayesian framework with a structured prior for partitions, extending previous models to handle multifile data and incorporating prior information and uncertainty.
Findings
Method performs well in extensive simulations.
Flexible loss functions allow partial resolution of uncertain data.
Code implementation is publicly available for reproducibility.
Abstract
Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Distributed systems and fault tolerance
