In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling
Neil G. Marchant, Benjamin I. P. Rubinstein

TL;DR
The paper introduces OASIS, an importance sampling algorithm for entity resolution evaluation that significantly reduces labeling effort while maintaining accurate estimates of performance metrics.
Contribution
OASIS is a novel sampling and estimation method that adaptively focuses on informative samples using a Bayesian model, improving efficiency in ER evaluation.
Findings
Achieves up to 83% reduction in labeling effort
Provides consistent estimates of F-measure, precision, and recall
Demonstrates superior performance over existing sampling methods
Abstract
Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates for rigorous evaluation. This paper addresses this important challenge with the OASIS algorithm: a sampler and F-measure estimator for ER evaluation. OASIS draws samples from a (biased) instrumental distribution, chosen to ensure estimators with optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on unlabelled items providing more information. We prove that resulting estimates of F-measure, precision, recall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Mobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data
