Probabilistic Blocking with An Application to the Syrian Conflict
Rebecca C. Steorts, Anshumali Shrivastava

TL;DR
This paper reviews modern blocking techniques for entity resolution, introduces new variants of locality sensitive hashing, and applies these methods to analyze data related to the Syrian conflict.
Contribution
It introduces KLSH, a subquadratic DOPH variant, and a weighted DOPH, expanding the toolkit for entity resolution with practical application insights.
Findings
KLSH effectively clusters similar records.
DOPH variants improve blocking efficiency.
Application to Syrian conflict data demonstrates practical utility.
Abstract
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce -means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Algorithms and Data Compression · Web Data Mining and Analysis
