A Bayesian Approach for De-duplication in the Presence of Relational Data
Juan Sosa, Abel Rodriguez

TL;DR
This paper introduces a Bayesian method for de-duplication that combines profile and network data, evaluates the impact of priors, and explores stochastic gradient Hamiltonian Monte Carlo for efficient sampling, tested on RLdata500.
Contribution
It presents a novel Bayesian approach integrating profile and network data for de-duplication and assesses the effectiveness of advanced sampling methods.
Findings
Combines profile and network data for improved de-duplication.
Evaluates the effect of different prior distributions on linkage accuracy.
Demonstrates the efficiency of stochastic gradient Hamiltonian Monte Carlo methods.
Abstract
In this paper, we study the impact of combining profile and network data in a de-duplication setting. We also assess the influence of a range of prior distributions on the linkage structure. Furthermore, we explore stochastic gradient Hamiltonian Monte Carlo methods as a faster alternative to obtain samples from the posterior distribution for network parameters. Our methodology is evaluated using the RLdata500 data, which is a popular dataset in the record linkage literature.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data
