The harms of class imbalance corrections for machine learning based prediction models: a simulation study
Alex Carriero, Kim Luijken, Anne de Hond, Karel GM Moons, Ben van, Calster, Maarten van Smeden

TL;DR
This study uses simulations and a case study to show that correcting for class imbalance in machine learning models often harms calibration, leading to overestimated risks and unreliable predictions in healthcare applications.
Contribution
It provides evidence that class imbalance correction can negatively impact model calibration, challenging common practices in clinical risk prediction modeling.
Findings
Models without imbalance correction had equal or better calibration.
Imbalance correction often caused risk over-estimation.
Re-calibration did not always fix miscalibration from imbalance correction.
Abstract
Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in the data). It is common for researchers to correct this class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out-of-sample predictive performance of models developed with an imbalance correction to those developed without a correction for class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Financial Distress and Bankruptcy Prediction
