Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods
Behnam Yousefimehr, Mehdi Ghatee, Javad Fazli, Shervin Ghaffari, Zahra Rafei, Mohammad Amin Seifi, Sajed Tavakoli, Abolfazl Nikahd, Mahdi Razi Gandomani, Alireza Orouji, Ramtin Mahmoudi Kashani, Sarina Heshmati, Negin Sadat Mousavi

TL;DR
This systematic survey reviews a wide range of data balancing methods in machine learning, analyzing their assumptions, mechanisms, and suitability for various data challenges, and highlights future research directions.
Contribution
It provides a comprehensive categorization and critical analysis of existing resampling and augmentation techniques, including advanced generative models and hybrid strategies, for imbalanced datasets.
Findings
No single method is universally best; effectiveness depends on dataset and task.
Advanced generative models like GANs and diffusion models are promising for oversampling.
Guidelines and future directions are proposed for practitioners and researchers.
Abstract
Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predictions toward the majority class and degrading classifier performance. This paper provides a comprehensive, systematic review of data balancing methods, extending beyond foundational oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants (e.g., Borderline SMOTE, K-Means SMOTE, and Safe-Level SMOTE) to encompass advanced adaptive methods (MWMOTE, AMDO), deep generative models (generative adversarial networks, variational autoencoders, and diffusion models), undersampling techniques (NearMiss, Tomek Links), combination/hybrid methods (SMOTE-ENN, SMOTE-Tomek, and SMOTE+OCSVM), ensemble strategies (SMOTEBoost, RUSBoost, Balanced Random Forest, and One-Sided Selection), and specialized approaches for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
