Confidence Intervals for Random Forest Permutation Importance with Missing Data

Nico F\"oge; Markus Pauly

arXiv:2507.13918·stat.ME·July 21, 2025·Comput. Stat.

Confidence Intervals for Random Forest Permutation Importance with Missing Data

Nico F\"oge, Markus Pauly

PDF

Open Access

TL;DR

This paper examines the impact of imputation methods on the validity of permutation importance confidence intervals in Random Forests with missing data, proposing an adjustment to improve coverage.

Contribution

It introduces an adaptation of Rubin's rule to improve confidence interval coverage for permutation importance in incomplete data scenarios.

Findings

01

Single imputation leads to low CI coverage.

02

Rubin's rule improves CI coverage when aggregating over multiple imputations.

03

Adjusted CIs achieve better nominal coverage in simulations and real data.

Abstract

Random Forests are renowned for their predictive accuracy, but valid inference, particularly about permutation-based feature importances, remains challenging. Existing methods, such as the confidence intervals (CIs) from Ishwaran et al. (2019), are promising but assume complete feature observation. However, real-world data often contains missing values. In this paper, we investigate how common imputation techniques affect the validity of Random Forest permutation-importance CIs when data are incomplete. Through an extensive simulation and real-world benchmark study, we compare state-of-the-art imputation methods across various missing-data mechanisms and missing rates. Our results show that single-imputation strategies lead to low CI coverage. As a remedy, we adapt Rubin's rule to aggregate feature-importance estimates and their variances over several imputed datasets and account for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Machine Learning in Healthcare · Imbalanced Data Classification Techniques