Toward design-based inference for data integration
Andrius \v{C}iginas, Ieva Burakauskait\.e, Jae Kwang Kim

TL;DR
This paper introduces a design-based framework for integrating non-probability samples into finite-population inference, avoiding assumptions about the selection mechanism and providing consistent estimators.
Contribution
It develops two generalized regression estimators that are design-consistent under any selection mechanism, including NMAR, and proposes a diagnostic test for estimator choice.
Findings
Estimators remain unbiased under MAR and NMAR.
Propensity-adjusted methods can be biased under NMAR.
Separate regression is better with heterogeneous strata, combined regression is more efficient with similar strata.
Abstract
Integrating non-probability samples into finite-population inference typically requires modeling unknown selection probabilities under a missing-at-random (MAR) assumption that is difficult to verify. We propose a design-based alternative in which the non-probability sample is treated as a fully observed certainty stratum and a probability sample is drawn only from the complementary, previously unsampled units. Within this sequential framework, we develop two generalized regression estimators: one fitting the outcome model separately in the complementary stratum, the other pooling both samples; we make two distinct contributions. First, both estimators are design-consistent and admit consistent variance estimators with no assumption whatsoever on the non-probability selection mechanism, including under not-missing-at-random (NMAR) selection. Second, under a working superpopulation model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
