Confidence Intervals for Random Forest Permutation Importance with Missing Data
Nico F\"oge, Markus Pauly

TL;DR
This paper examines the impact of imputation methods on the validity of permutation importance confidence intervals in Random Forests with missing data, proposing an adjustment to improve coverage.
Contribution
It introduces an adaptation of Rubin's rule to improve confidence interval coverage for permutation importance in incomplete data scenarios.
Findings
Single imputation leads to low CI coverage.
Rubin's rule improves CI coverage when aggregating over multiple imputations.
Adjusted CIs achieve better nominal coverage in simulations and real data.
Abstract
Random Forests are renowned for their predictive accuracy, but valid inference, particularly about permutation-based feature importances, remains challenging. Existing methods, such as the confidence intervals (CIs) from Ishwaran et al. (2019), are promising but assume complete feature observation. However, real-world data often contains missing values. In this paper, we investigate how common imputation techniques affect the validity of Random Forest permutation-importance CIs when data are incomplete. Through an extensive simulation and real-world benchmark study, we compare state-of-the-art imputation methods across various missing-data mechanisms and missing rates. Our results show that single-imputation strategies lead to low CI coverage. As a remedy, we adapt Rubin's rule to aggregate feature-importance estimates and their variances over several imputed datasets and account for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Machine Learning in Healthcare · Imbalanced Data Classification Techniques
