CTAB-GAN: Effective Table Data Synthesizing
Zilong Zhao, Aditya Kunar, Hiek Van der Scheer, Robert Birke, Lydia, Y. Chen

TL;DR
CTAB-GAN is a novel conditional GAN architecture designed to generate high-quality synthetic tabular data with mixed data types, addressing data imbalance and distribution issues to improve data utility and privacy compliance.
Contribution
The paper introduces CTAB-GAN, a new conditional GAN model that effectively models diverse tabular data types and handles data imbalance and skewness.
Findings
Synthetic data closely resembles real data across variable types.
Higher machine learning accuracy achieved with synthetic data, up to 17%.
Outperforms existing GAN-based tabular data synthesizers.
Abstract
While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) unfortunately limit its full effectiveness. Synthetic tabular data emerges as an alternative to enable data sharing while fulfilling regulatory and privacy constraints. The state-of-the-art tabular data synthesizers draw methodologies from generative Adversarial Networks (GAN) and address two main data types in the industry, i.e., continuous and categorical. In this paper, we develop CTAB-GAN, a novel conditional table GAN architecture that can effectively model diverse data types, including a mix of continuous and categorical variables. Moreover, we address data imbalance and long-tail issues, i.e., certain variables have drastic frequency differences across large values. To achieve those aims, we first introduce the information loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Privacy-Preserving Technologies in Data
