Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

Mario Villaiz\'an-Vallelado; Matteo Salvatori; Carlos Segura; Ioannis Arapakis

arXiv:2407.02549·cs.LG·June 10, 2025

Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

Mario Villaiz\'an-Vallelado, Matteo Salvatori, Carlos Segura, Ioannis Arapakis

PDF

3 Reviews

TL;DR

This paper introduces a novel diffusion model with transformer-based conditioning and dynamic masking for effective tabular data imputation and synthetic data generation, outperforming existing methods in efficiency, similarity, and privacy.

Contribution

It presents a new diffusion model architecture with attention, transformer, and masking enhancements tailored for tabular data tasks, unifying imputation and generation.

Findings

01

Outperforms state-of-the-art models like VAE, GAN, and existing diffusion models.

02

Effectively handles missing data imputation with high efficiency.

03

Generates statistically similar data with reduced privacy risks.

Abstract

Data imputation and data generation have important applications for many domains, like healthcare and finance, where incomplete or missing data can hinder accurate analysis and decision-making. Diffusion models have emerged as powerful generative models capable of capturing complex data distributions across various data modalities such as image, audio, and time series data. Recently, they have been also adapted to generate tabular data. In this paper, we propose a diffusion model for tabular data that introduces three key enhancements: (1) a conditioning attention mechanism, (2) an encoder-decoder transformer as the denoising network, and (3) dynamic masking. The conditioning attention mechanism is designed to improve the model's ability to capture the relationship between the condition and synthetic data. The transformer layers help model interactions within the condition (encoder) or…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

Overall: The paper is easy to read and the contribution is simple but effective. The experiments cover a wide range of datasets though not algorithms. Pros: (i) The paper extends TabDDPM to TabGenDDPM utilizing the transformer architecture which has been wildly succesful in other generative settings. The experiments confirm the benefits of the proposed approach. The additional benefit of covering both imputation and generation in the same framework enables a wide range of usecases in real-worl

Weaknesses

Cons: (a) Some of the other competing methods like AIM, CTAB-GAN+ and others are not compared in the paper. (b) The number of features in the datasets are few. HELOC has the highest with only 21 features and it is unclear how this framework performs when the feature set is large.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

The experimental comparisons are good. The author conducts TabGenDDPM on eight datasets under three evaluation criteria.

Weaknesses

1. The overall contribution of this paper is limited. All of the content except the transformer conditioning architecture is already known. The architecture design is heuristic, which has no theoretical guarantees of the performance. Moreover, they build upon Variance Preserving (VP) SDE (e.g., DDPM or TabDDPM in tabular data). The author does not mention wether their method work for Variance Exploding (VE) SDE (e.g, Score-based generative model, StaSy [1] in tabular data). [1]: Kim, J., Lee,

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The proposed architecture is a natural improvement from TabDDPM and according to the experiments, it seems to really improve the model in term of ML-efficacy - The paper is clear and well written with several illustrations - The privacy risk is considered

Weaknesses

- The proposed architecture is mostly a derivative work from TabDDPM - The proposed diffusion algorithms are a bit outdated now, especially on the discrete side since works like: Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces" NeurIPS 2021, or Campbell et al. "A Continuous Time Framework for Discrete Denoising Models" NeurIPS 2022. It is worth noting that "mask" systems are also studied in (Austin et al. 2021). - No ablation study to validate the separately differe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Diffusion