Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan, Shen, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, George Karypis

TL;DR
Tabsyn introduces a novel diffusion-based approach within a VAE latent space to generate high-quality, fast, and versatile synthetic tabular data across diverse data types and complex inter-column relations.
Contribution
The paper presents Tabsyn, a new method that effectively synthesizes tabular data by combining diffusion models with a VAE in a unified latent space, handling various data types and relations.
Findings
Outperforms existing methods on six datasets
Reduces distribution error rates by up to 86%
Generates data faster with fewer reverse steps
Abstract
Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCancer-related molecular mechanisms research · Generative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
