Unique Entity Estimation with Application to the Syrian Conflict
Beidi Chen, Anshumali Shrivastava, Rebecca C. Steorts

TL;DR
This paper introduces an efficient algorithm for estimating the number of unique entities in large, noisy datasets, demonstrated on Syrian conflict data, with applications in conflict analysis and data deduplication.
Contribution
It proposes a near-linear time estimator based on locality sensitive hashing that is unbiased and has low variance, improving over existing methods for large-scale entity estimation.
Findings
Estimator is unbiased and has low variance.
Empirical results show superiority over state-of-the-art methods.
Application to Syrian conflict data yields estimates close to expert assessments.
Abstract
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data-Driven Disease Surveillance
