Restoring balance: principled under/oversampling of data for optimal classification
Emanuele Loffredo, Mauro Pastore, Simona Cocco, R\'emi Monasson

TL;DR
This paper provides a theoretical framework for understanding how under- and oversampling strategies affect the generalization performance of linear classifiers in imbalanced data scenarios, supported by empirical validation.
Contribution
It derives analytical expressions for generalization curves in high-dimensional settings and predicts the impact of sampling strategies based on data statistics and class imbalance.
Findings
Mixed sampling strategies improve performance.
Theoretical predictions match empirical results.
Sampling effects depend on data moments and class imbalance.
Abstract
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Reservoir Engineering and Simulation Methods · Water resources management and optimization
