Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods

Behnam Yousefimehr; Mehdi Ghatee; Javad Fazli; Shervin Ghaffari; Zahra Rafei; Mohammad Amin Seifi; Sajed Tavakoli; Abolfazl Nikahd; Mahdi Razi Gandomani; Alireza Orouji; Ramtin Mahmoudi Kashani; Sarina Heshmati; Negin Sadat Mousavi

arXiv:2505.13518·stat.ML·April 30, 2026

Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods

Behnam Yousefimehr, Mehdi Ghatee, Javad Fazli, Shervin Ghaffari, Zahra Rafei, Mohammad Amin Seifi, Sajed Tavakoli, Abolfazl Nikahd, Mahdi Razi Gandomani, Alireza Orouji, Ramtin Mahmoudi Kashani, Sarina Heshmati, Negin Sadat Mousavi

PDF

TL;DR

This systematic survey reviews a wide range of data balancing methods in machine learning, analyzing their assumptions, mechanisms, and suitability for various data challenges, and highlights future research directions.

Contribution

It provides a comprehensive categorization and critical analysis of existing resampling and augmentation techniques, including advanced generative models and hybrid strategies, for imbalanced datasets.

Findings

01

No single method is universally best; effectiveness depends on dataset and task.

02

Advanced generative models like GANs and diffusion models are promising for oversampling.

03

Guidelines and future directions are proposed for practitioners and researchers.

Abstract

Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predictions toward the majority class and degrading classifier performance. This paper provides a comprehensive, systematic review of data balancing methods, extending beyond foundational oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants (e.g., Borderline SMOTE, K-Means SMOTE, and Safe-Level SMOTE) to encompass advanced adaptive methods (MWMOTE, AMDO), deep generative models (generative adversarial networks, variational autoencoders, and diffusion models), undersampling techniques (NearMiss, Tomek Links), combination/hybrid methods (SMOTE-ENN, SMOTE-Tomek, and SMOTE+OCSVM), ensemble strategies (SMOTEBoost, RUSBoost, Balanced Random Forest, and One-Sided Selection), and specialized approaches for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.