A Prior for Record Linkage Based on Allelic Partitions
Brenda Betancourt, Juan Sosa, Abel Rodr\'iguez

TL;DR
This paper introduces a new prior distribution based on allelic partitions tailored for record linkage, effectively modeling small clusters and incorporating prior size information, with demonstrated competitive performance on real datasets.
Contribution
The paper proposes a novel class of microclustering priors based on allelic partitions, specifically designed for record linkage with small clusters, and introduces new constraints for cluster size control.
Findings
The proposed priors perform competitively against state-of-the-art models.
The approach effectively models the microclustering property in record linkage.
Different loss functions for partition estimation are compared and evaluated.
Abstract
In database management, record linkage aims to identify multiple records that correspond to the same individual. This task can be treated as a clustering problem, in which a latent entity is associated with one or more noisy database records. However, in contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. In this paper, we introduce a new class of prior distributions based on allelic partitions that is specially suited for the small cluster setting of record linkage. Our approach makes it straightforward to introduce prior information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size -known as the microclustering property. We also introduce a set of novel microclustering conditions in order to impose further constraints on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data Management and Algorithms
