Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add
Zhengchi Ma, Anru R. Zhang

TL;DR
This paper develops a statistical framework to understand when synthetic augmentation helps in imbalanced learning, showing it can sometimes harm performance and proposing a validation-based method to select optimal synthetic sample size.
Contribution
It introduces a unified theory explaining when synthetic augmentation is beneficial and how to choose the optimal number of synthetic samples based on generator accuracy and data imbalance.
Findings
Synthetic augmentation is not always beneficial and can degrade performance.
Optimal synthetic size depends on generator accuracy and alignment with data shift.
Validation-Tuned Synthetic Size (VTSS) effectively guides synthetic sample selection.
Abstract
Imbalanced classification often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic samples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples. Our theory shows that synthetic data is not always beneficial. In a "local symmetry" regime, imbalance is not the dominant source of error, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help ("local asymmetry"), the optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Algorithms · Adversarial Robustness in Machine Learning
