Partial Resampling of Imbalanced Data
Firuz Kamalov, Amir F. Atiya, Dina Elreedy

TL;DR
This study investigates how different sampling ratios affect classification accuracy on imbalanced datasets, revealing an optimal ratio range and factors influencing it across multiple datasets.
Contribution
It provides a large-scale analysis of sampling ratio effects on imbalanced data classification, identifying optimal ratios and influencing factors.
Findings
Optimal sampling ratio is between 0.7 and 0.8.
Original imbalance ratio does not significantly affect the optimal ratio.
Number of samples influences the optimal sampling ratio.
Abstract
Imbalanced data is a frequently encountered problem in machine learning. Despite a vast amount of literature on sampling techniques for imbalanced data, there is a limited number of studies that address the issue of the optimal sampling ratio. In this paper, we attempt to fill the gap in the literature by conducting a large scale study of the effects of sampling ratio on classification accuracy. We consider 10 popular sampling methods and evaluate their performance over a range of ratios based on 20 datasets. The results of the numerical experiments suggest that the optimal sampling ratio is between 0.7 and 0.8 albeit the exact ratio varies depending on the dataset. Furthermore, we find that while factors such the original imbalance ratio or the number of features do not play a discernible role in determining the optimal ratio, the number of samples in the dataset may have a tangible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Electricity Theft Detection Techniques · Advanced Statistical Process Monitoring
