Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study
Jakob Schwerter (1), Andr\'es Romero (1), Florian Dumpert (2), Markus, Pauly (1,3) ((1) TU Dortmund University, Dortmund, Germany, (2) Federal, Statistical Office of Germany, Wiesbaden, Germany, (3) Research Center, Trustworthy Data Science, Security, Dortmund, Germany)

TL;DR
This study evaluates how different imputation methods affect feature importance and selection in tree-based and linear models using a survey-based simulation, providing guidance for choosing imputation techniques in practice.
Contribution
It systematically compares eight imputation methods across three learning algorithms to determine their impact on feature importance in survey data analysis.
Findings
Imputation choice significantly influences feature importance results.
Certain imputation methods better preserve feature importance for tree-based models.
The study offers practical recommendations for imputation in survey data analysis.
Abstract
Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurvey Methodology and Nonresponse
MethodsFeature Selection
