An improved tabular data generator with VAE-GMM integration
Patricia A. Apell\'aniz, Juan Parras, Santiago Zazo

TL;DR
This paper introduces a novel VAE-GMM based model for generating synthetic tabular data that better captures complex data distributions, outperforming existing GAN-based methods like CTGAN and TVAE, especially in healthcare applications.
Contribution
The paper presents a VAE-GMM integrated model that effectively handles mixed data types and non-Gaussian distributions, improving synthetic data generation over prior models.
Findings
Outperforms CTGAN and TVAE on real-world datasets
Handles both continuous and discrete features effectively
Provides more accurate data distribution modeling
Abstract
The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRough Sets and Fuzzy Logic
