Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent   Space

Hengrui Zhang; Jiani Zhang; Balasubramaniam Srinivasan; Zhengyuan; Shen; Xiao Qin; Christos Faloutsos; Huzefa Rangwala; George Karypis

arXiv:2310.09656·cs.LG·May 14, 2024·23 cites

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan, Shen, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, George Karypis

PDF

Open Access 1 Repo 1 Video

TL;DR

Tabsyn introduces a novel diffusion-based approach within a VAE latent space to generate high-quality, fast, and versatile synthetic tabular data across diverse data types and complex inter-column relations.

Contribution

The paper presents Tabsyn, a new method that effectively synthesizes tabular data by combining diffusion models with a VAE in a unified latent space, handling various data types and relations.

Findings

01

Outperforms existing methods on six datasets

02

Reduces distribution error rates by up to 86%

03

Generates data faster with fewer reverse steps

Abstract

Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/tabsyn
pytorchOfficial

Videos

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space· slideslive

Taxonomy

TopicsCancer-related molecular mechanisms research · Generative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings