Concentration and excess risk bounds for imbalanced classification with synthetic oversampling

Touqeer Ahmad; Mohammadreza M. Kalan; Fran\c{c}ois Portier; Gilles Stupfler

arXiv:2510.20472·stat.ML·October 24, 2025

Concentration and excess risk bounds for imbalanced classification with synthetic oversampling

Touqeer Ahmad, Mohammadreza M. Kalan, Fran\c{c}ois Portier, Gilles Stupfler

PDF

Open Access

TL;DR

This paper develops a theoretical framework to analyze the behavior and risk bounds of synthetic oversampling methods like SMOTE in imbalanced classification, providing insights for better parameter tuning.

Contribution

It introduces the first theoretical analysis of SMOTE's effects on classifier risk, including concentration bounds and excess risk guarantees for kernel classifiers.

Findings

01

Derived uniform concentration bounds for synthetic data risk

02

Provided nonparametric excess risk guarantees for kernel classifiers

03

Offered practical guidelines for parameter tuning of SMOTE

Abstract

Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Statistical Methods and Inference · Machine Learning and Algorithms