A Practical Approach to Proper Inference with Linked Data
Andee Kaplan, Brenda Betancourt, and Rebecca C. Steorts

TL;DR
This paper introduces scalable Bayesian methods for canonicalization in entity resolution, improving the propagation of uncertainty to downstream inference tasks like regression, demonstrated on simulated and real voter data.
Contribution
It proposes five scalable Bayesian canonicalization methods that effectively incorporate ER uncertainty into downstream analysis, enhancing inference accuracy.
Findings
Bayesian canonicalization improves downstream inference accuracy.
Methods are scalable and applicable to general data scenarios.
Empirical evaluation shows better prediction and coverage in regression tasks.
Abstract
Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Topic Modeling
MethodsLinear Regression
