Synthetic Tabular Data Generation for Imbalanced Classification: The   Surprising Effectiveness of an Overlap Class

Annie D'souza; Swetha M; Sunita Sarawagi

arXiv:2412.15657·cs.LG·February 20, 2025

Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

Annie D'souza, Swetha M, Sunita Sarawagi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel pre-processing technique that converts binary class labels into ternary labels by adding an overlap class, significantly improving the quality of synthetic data and classifier accuracy in imbalanced tabular datasets.

Contribution

The paper proposes a new method of handling class imbalance by introducing an overlap class, enhancing the performance of deep generative models and classifiers on tabular data.

Findings

01

Improved quality of synthetic minority class data.

02

Enhanced classifier accuracy on imbalanced datasets.

03

Effective across multiple datasets and models.

Abstract

Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

annie2603/ord
pytorchOfficial

Videos

Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class· underline

Taxonomy

TopicsMedical Coding and Health Information

MethodsDiffusion