Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased
Nathan Phelps, Daniel J. Lizotte, and Douglas G. Woolford

TL;DR
This paper investigates how hyperparameters and sampling strategies in tree-based models affect prevalence estimates in imbalanced classification, revealing biases and the potential for minority class bias.
Contribution
It demonstrates that calibration methods can introduce biases in prevalence estimates and uncovers the surprising bias of decision trees towards the minority class.
Findings
Prevalence estimates depend on hyperparameters and sampling rate.
Calibrating random forests can lead to biased prevalence estimates.
Decision trees can be biased towards the minority class.
Abstract
When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model's predictions because the model learns from data whose data generating process differs from new data. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration. Through investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData-Driven Disease Surveillance
