Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class
Annie D'souza, Swetha M, Sunita Sarawagi

TL;DR
This paper introduces a novel pre-processing technique that converts binary class labels into ternary labels by adding an overlap class, significantly improving the quality of synthetic data and classifier accuracy in imbalanced tabular datasets.
Contribution
The paper proposes a new method of handling class imbalance by introducing an overlap class, enhancing the performance of deep generative models and classifiers on tabular data.
Findings
Improved quality of synthetic minority class data.
Enhanced classifier accuracy on imbalanced datasets.
Effective across multiple datasets and models.
Abstract
Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMedical Coding and Health Information
MethodsDiffusion
