Variable selection via knockoffs in missing data settings with categorical predictors
Silvia Bacci, Emanuela Dreassi, Leonardo Grilli, Carla Rampichini

TL;DR
This paper extends the knockoffs method for variable selection to handle missing data and categorical predictors in large-scale assessment datasets, using multiple imputation and applying it to Italian student test scores.
Contribution
It introduces a novel approach combining multiple imputation with knockoffs for variable selection in missing data settings with categorical variables.
Findings
The method performs satisfactorily in simulations.
It is effective in real-world assessment data with missing and categorical variables.
The approach is flexible and feasible for complex multilevel data.
Abstract
Large-scale assessment data typically include numerous categorical variables, often affected by missing values. Motivated by the challenges arising in this framework, we extend the knockoffs method for selecting predictors to settings with missing values. Our proposal relies on a preliminary phase consisting of multiple imputations of missing values. Each imputed dataset is then processed using a suitable knockoff filter. We evaluate the performance of the proposed method through a simulation study, showing satisfactory results consistent with a recently advocated cutting-edge method. We apply the method to large-scale assessment data collected by INVALSI about test scores of Italian students in grade 5 with many background variables. This case study is challenging, as most predictors have unordered categories, a setting not taken into account by traditional knockoffs methods. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
