Revisiting the probabilistic method of record linkage
Abel Dasylva, Arthur Goussanou, David Ajavon, Hanan Abousaleh

TL;DR
This paper introduces a new probabilistic record linkage methodology that overcomes previous limitations by modeling record neighborhoods, providing reliable linkage probabilities and error estimates without relying on restrictive assumptions.
Contribution
It proposes a novel finite mixture model with identification properties, enabling accurate, unsupervised record linkage and error evaluation without clerical reviews or independence assumptions.
Findings
Model handles large populations effectively.
Provides bounds on linkage probabilities.
Enables unsupervised machine learning for record linkage.
Abstract
In theory, the probabilistic linkage method provides two distinct advantages over non-probabilistic methods, including minimal rates of linkage error and accurate measures of these rates for data users. However, implementations can fall short of these expectations either because the conditional independence assumption is made, or because a model with interactions is used but lacks the identification property. In official statistics, this is currently the main challenge to the automated production and use of linked data. To address this challenge, a new methodology is described for proper linkage problems, where matched records may be identified with a probability that is bounded away from zero, regardless of the population size. It models the number of neighbours of a given record, i.e. the number of resembling records. To be specific, the proposed model is a finite mixture where each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Distributed systems and fault tolerance
