Flexible variable selection in the presence of missing data
B. D. Williamson, Y. Huang

TL;DR
This paper introduces a nonparametric variable selection method with multiple imputation to effectively identify predictive feature panels in datasets with missing-at-random data, outperforming traditional penalized regression when models are misspecified.
Contribution
It proposes a novel nonparametric algorithm combined with multiple imputation for flexible variable selection under missing data, addressing limitations of model-based approaches.
Findings
Demonstrates improved classification performance over penalized regression methods.
Achieves control of error rates in variable selection.
Successfully applied to biomarker panel development for pancreatic cysts.
Abstract
In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Optimal Experimental Design Methods
