Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study
Emmanouil Panagiotou, Arjun Roy, Eirini Ntoutsi

TL;DR
This paper compares different synthetic tabular data generation methods and sampling strategies to address class and group imbalances, aiming to improve fairness and utility in machine learning models.
Contribution
It provides a comprehensive comparison of state-of-the-art generative models and sampling techniques for mitigating class and group imbalances in tabular data.
Findings
Generative models effectively reduce bias in imbalanced datasets
Sampling strategies improve fairness without sacrificing accuracy
Experimental results on four datasets validate the approaches
Abstract
Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Financial Distress and Bankruptcy Prediction
