Generalized Oversampling for Learning from Imbalanced datasets and Associated Theory
Samuel Stocksieker, Denys Pommeret, Arthur Charpentier

TL;DR
This paper introduces GOLIATH, a generalized oversampling method based on kernel density estimates, designed to improve learning from imbalanced datasets in both classification and regression tasks, with strong empirical results.
Contribution
It presents a novel data augmentation algorithm that unifies and extends existing oversampling techniques for imbalanced learning, including regression.
Findings
GOLIATH significantly outperforms existing methods in imbalanced regression.
The approach provides explicit formulas for synthetic data generation.
Empirical results demonstrate improved accuracy and robustness.
Abstract
In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates which can be used in classification and regression. This general approach encompasses two large families of synthetic oversampling: those based on perturbations, such as Gaussian Noise, and those based on interpolations, such as SMOTE. It also provides an explicit form of these machine learning algorithms and an expression of their conditional densities, in particular for SMOTE. New synthetic data generators are deduced. We apply GOLIATH in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Artificial Intelligence in Healthcare · Electricity Theft Detection Techniques
MethodsSynthetic Minority Over-sampling Technique.
