Benchmarking missing-values approaches for predictive models on health databases
Alexandre Perez-Lebel (MNI, MILA, PARIETAL), Ga\"el Varoquaux (MNI,, MILA, PARIETAL), Marine Le Morvan (PARIETAL), Julie Josse (CRISAM, IDESP),, Jean-Baptiste Poline (MNI)

TL;DR
This paper systematically benchmarks missing-value handling strategies for predictive models on large health databases, showing native support for missing values often outperforms imputation methods in accuracy and efficiency.
Contribution
It demonstrates that native missing-value support in gradient-boosted trees is superior to imputation in predictive accuracy and computational efficiency on large health datasets.
Findings
Native missing-value support outperforms imputation in accuracy.
Adding indicator variables improves imputation-based predictions.
Native support is faster and more robust than complex imputation methods.
Abstract
BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS: Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Statistical Methods and Bayesian Inference · Sepsis Diagnosis and Treatment
