TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

Juntong Shi; Minkai Xu; Harper Hua; Hengrui Zhang; Stefano Ermon; Jure; Leskovec

arXiv:2410.20626·cs.LG·February 18, 2025·2 cites

TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, Jure, Leskovec

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TabDiff is a novel diffusion-based generative model designed for high-quality tabular data synthesis, effectively handling mixed data types and complex inter-column relationships through a joint continuous-time diffusion process and transformer architecture.

Contribution

It introduces a unified diffusion framework for mixed-type tabular data, including feature-wise learnable processes and a mixed-type sampler for improved data generation quality.

Findings

01

Achieves up to 22.5% improvement over state-of-the-art in correlation estimation.

02

Outperforms existing methods across all evaluated metrics on seven datasets.

03

Efficient end-to-end training with a transformer-based parameterization.

Abstract

Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well-written and well-structured. It is self-content and easy to follow. 2. The problem investigated in this paper is well-motivated and practical. 3. While most of the techniques have been investigated in prior work, this work seamlessly integrates them and adapts to the tabular data generation task. 4. Extensive experiments sufficiently verify the effectiveness of the proposed approach, and ablation studies further outline the design choices and the benefits of the learnable no

Weaknesses

1. Despite the claim of cross-modal modeling, the proposed method currently still models numerical and categorical values separately, which might make it difficult to synthesize more complicated tabular datasets. 2. While I generally appreciate the effort and the performance, I am not fully convinced by the claim of the multi-modal diffusion framework. The experiments in Tables 1, 2, and 3 all suggest comparable performance compared to TabSyn. As TabSyn has been published in ICLR 2024, the auth

Reviewer 02Rating 6Confidence 3

Strengths

This manuscript demonstrates advantages in multiple aspects. It provides a clear and comprehensive introduction of the research problem, and conducts extensive experiments on multiple data sets and evaluation criteria. Additionally, the paper elaborates on relevant algorithms and techniques. The algorithmic sections present some innovative and original findings, contributing to advancements in the field of machine learning and offering new insights to the academic community.

Weaknesses

1. Some necessary explanations are lacking in the description of algorithm provided in the Method and Appendix, requiring more detailed descriptions to enhance the readability of the paper. 2. The description of categorical column needs an example to illustrate the method more clearly. 3. Formula (10) is missing a parenthesis. 4. Why are formula (10) and formula (11) constructed this way? Is there a source? If not, can you explain the construction idea?

Reviewer 03Rating 6Confidence 3

Strengths

**1** The experiments are extensive, demonstrating the efficiency of the proposed method. **2** Hybrid diffusion models are employed to handle mixed-type data in tabular datasets.

Weaknesses

**1** The authors claim that they explicitly tackle the feature-wise heterogeneity issue in the multi-modal diffusion process. However, the details about how to handle this problem is missing. **2** Even though experiments show the efficiency of the proposed method, it lacks of the motivation of considering a masked diffusion for categorical features.

Code & Models

Repositories

minkaixu/tabdiff
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction

MethodsDiffusion