Estimating Omissions from Searches
Anthony J Webster, Richard Kemp

TL;DR
This paper provides an exact calculation of the moments of the hypergeometric distribution for search estimation problems, revealing that traditional estimates are often too small and demonstrating how search strategies influence results.
Contribution
It introduces an exact calculation of hypergeometric moments for search estimation and explores the impact of different strategies using a Bayesian approach.
Findings
Widely used estimates from 1951 are often underestimated.
Different search strategies significantly affect estimate accuracy.
Bayesian methods can improve estimation in systematic reviews.
Abstract
The mark-recapture method was devised by Petersen in 1896 to estimate the number of fish migrating into the Limfjord, and independently by Lincoln in 1930 to estimate waterfowl abundance. The technique applies to any search for a finite number of items by two or more people or agents, allowing the number of searched-for items to be estimated. This ubiquitous problem appears in fields from ecology and epidemiology, through to mathematics, social sciences, and computing. Here we exactly calculate the moments of the hypergeometric distribution associated with this long-standing problem, confirming that widely used estimates conjectured in 1951 are often too small. Our Bayesian approach highlights how different search strategies will modify the estimates. As an example, we assess the accuracy of a systematic literature review, an application we recommend.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
