Optimal Sampling for Generalized Linear Model under Measurement Constraint with Surrogate Variables
Yixin Shen, Yang Ning

TL;DR
This paper introduces an optimal sampling method for generalized linear models that leverages surrogate variables with measurement errors, improving estimation efficiency under data labeling constraints.
Contribution
It develops a novel sampling strategy using surrogate variables and A-optimality, achieving lower asymptotic variance than existing methods without surrogates.
Findings
Outperforms existing sampling algorithms in empirical mean squared error
Provides consistent estimators under measurement constraints
Enhances robustness in practical scenarios
Abstract
Measurement-constrained datasets, often encountered in semi-supervised learning, arise when data labeling is costly, time-intensive, or hindered by confidentiality or ethical concerns, resulting in a scarcity of labeled data. In certain cases, surrogate variables are accessible across the entire dataset and can serve as approximations to the true response variable; however, these surrogates often contain measurement errors and thus cannot be directly used for accurate prediction. We propose an optimal sampling strategy that effectively harnesses the available information from surrogate variables. This approach provides consistent estimators under the assumption of a generalized linear model, achieving theoretically lower asymptotic variance than existing optimal sampling algorithms that do not use surrogate data information. By employing the A-optimality criterion from optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Scientific Measurement and Uncertainty Evaluation · Advanced Statistical Process Monitoring
