Query-Driven Sampling for Collective Entity Resolution
Christan Grant, Daisy Zhe Wang, Michael L. Wick

TL;DR
This paper introduces query-driven collective entity resolution techniques that enable faster, on-demand resolution of entities in large probabilistic databases, significantly reducing computation time for practical applications.
Contribution
It proposes novel SQL-based ER query classes and biased sampling algorithms to perform efficient, real-time collective entity resolution on large datasets.
Findings
Query-driven ER converges within minutes on large datasets.
Biased sampling improves efficiency of MCMC inference.
Selective ER provides accurate results with reduced computation.
Abstract
Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
