Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data
Yan Fan, Yang Liu, Yukun Liu, Jing Qin

TL;DR
This paper introduces a nearly optimal capture-recapture sampling plan combined with empirical likelihood weighting for efficient M-estimation in big data, overcoming IPW instability and utilizing auxiliary information for improved accuracy.
Contribution
It develops a novel capture-recapture sampling strategy with empirical likelihood weighting that enhances estimation efficiency and reduces computational costs in big data analysis.
Findings
ELW method outperforms IPW in estimator efficiency
Proposed sampling plan achieves a balance between accuracy and computational cost
Simulation and real data confirm the method's advantages
Abstract
Subsampling techniques can reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform sampling and refined sampling. With a subsample, big data inferences are generally built on the inverse probability weighting (IPW), which becomes unstable when the probability weights are close to zero and cannot incorporate auxiliary information. First, we consider capture-recapture sampling, which combines an initial uniform sampling with a second Poisson sampling. Under this sampling plan, we propose an empirical likelihood weighting (ELW) estimation approach to an M-estimation parameter. Second, based on the ELW method, we construct a nearly optimal capture-recapture sampling plan that balances estimation efficiency and computational costs. Third, we derive methods for determining the smallest sample sizes with which the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCensus and Population Estimation · Statistical Methods and Bayesian Inference · Data-Driven Disease Surveillance
