Concentration and excess risk bounds for imbalanced classification with synthetic oversampling
Touqeer Ahmad, Mohammadreza M. Kalan, Fran\c{c}ois Portier, Gilles Stupfler

TL;DR
This paper develops a theoretical framework to analyze the behavior and risk bounds of synthetic oversampling methods like SMOTE in imbalanced classification, providing insights for better parameter tuning.
Contribution
It introduces the first theoretical analysis of SMOTE's effects on classifier risk, including concentration bounds and excess risk guarantees for kernel classifiers.
Findings
Derived uniform concentration bounds for synthetic data risk
Provided nonparametric excess risk guarantees for kernel classifiers
Offered practical guidelines for parameter tuning of SMOTE
Abstract
Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Statistical Methods and Inference · Machine Learning and Algorithms
