Unbiased Prevalence Estimation with Multicalibrated LLMs
Fridolin Linder, Thomas Leeper, Daniel Haimovich, Niek Tax, Lorenzo Perini, Milan Vojnovic

TL;DR
This paper demonstrates that multicalibration ensures unbiased prevalence estimation across populations with covariate shift, outperforming standard calibration methods, with theoretical backing and practical applications involving large language models.
Contribution
It introduces multicalibration as a method to achieve unbiased prevalence estimates under covariate shift, extending fairness concepts to measurement problems across disciplines.
Findings
Standard methods exhibit bias increasing with shift magnitude.
Multicalibrated estimators maintain near-zero bias under covariate shift.
Empirical applications show multicalibration reduces bias significantly.
Abstract
Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
