Variable Selection for Linear Regression Imputation in Surveys
Ziming An, Mehdi Dagdoug, David Haziza

TL;DR
This paper investigates variable selection for linear regression imputation in survey data, proposing a framework that ensures valid confidence intervals and optimality, with theoretical guarantees and simulation validation.
Contribution
It introduces an optimal imputation model concept, analyzes effects of model misspecification, and develops a method for valid confidence intervals post-model selection.
Findings
Optimal model coincides with the true model with high probability
Misspecified models affect estimator consistency and variance
Proposed confidence intervals are asymptotically valid and optimal
Abstract
Survey sampling is concerned with the estimation of finite population parameters. In practice, survey data suffer from item nonresponse, which is commonly handled through imputation, i.e., replacing missing values with predicted values. As a result, the properties of the resulting imputed estimator depend critically on the properties of the prediction method used. In turn, prediction methods themselves depend on the choice of variables and tuning parameters used to fit the imputation model. In this article, we study the problem of variable selection for linear regression imputation. Although variable selection has been widely studied across many fields, primarily for identification or prediction, its role in imputation for survey data has received comparatively little attention. We introduce the notion of an optimal imputation model defined through an oracle loss function and show that,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Survey Sampling and Estimation Techniques · Survey Methodology and Nonresponse
