Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection
Yuwen Jiang, Songyun Ye

TL;DR
This study challenges the IR-threshold paradigm by showing data characteristics like class separability are more influential than imbalance ratio in oversampling effectiveness, based on extensive experiments.
Contribution
It introduces a 'Context Matters' framework that incorporates multiple data characteristics for better oversampling method selection.
Findings
IR has a weak to moderate negative correlation with oversampling benefits.
Class separability is a stronger predictor of oversampling effectiveness than IR.
The proposed framework combines IR, class separability, and cluster structure for informed decision-making.
Abstract
The prevailing IR-threshold paradigm posits a positive correlation between imbalance ratio (IR) and oversampling effectiveness, yet this assumption remains empirically unsubstantiated through controlled experimentation. We conducted 12 controlled experiments (N > 100 dataset variants) that systematically manipulated IR while holding data characteristics (class separability, cluster structure) constant via algorithmic generation of Gaussian mixture datasets. Two additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML. Upon controlling for confounding variables, IR exhibited a weak to moderate negative correlation with oversampling benefits. Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone. We propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
