Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data
Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

TL;DR
This paper introduces TabDAR, a novel model combining diffusion and autoregressive techniques with masked transformers to generate heterogeneous tabular data flexibly, handling continuous and discrete features and arbitrary column orders.
Contribution
The paper proposes a new diffusion-nested autoregressive model for tabular data that addresses heterogeneity and permutation invariance, enabling flexible and high-quality data synthesis.
Findings
Outperforms previous methods by 18-45% on multiple metrics
Handles both continuous and discrete data effectively
Supports unconditional and conditional sampling
Abstract
Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional…
Peer Reviews
Decision·Submitted to ICLR 2025
- This paper tackles an interesting and important problem -- how to learn the generative distribution of tabular data. Tabular data is indeed challenging because of the potentially unstructured and diverse nature of the columns at hand. As the authors mention, there has been a lot of recent interest in developing deep generative models for this domain. - The authors' proposed solution is an interesting one. They provide comparisons in the experiments against several other methods in the liter
- Nitpick: The authors sometimes use overly promotional language, e.g. line 71 — “through two ingenious design features”, line 84 — “TABDAR offers several unparalleled advantages” — I recommend against the use of overly promotional words such as “ingenious”, “unparalleled advantages” in the paper. - Nitpick: Eq. (5) — Define $x^{<i}$. For a reader familiar with autoregressive modeling notation, it is obvious what this means, but for others it may not be. E.g. it would be good to clarify what
- Good results, and extensive experimental results.
- The TabSyn methods results stated in the paper are different (worse) that those reported in the TabSyn paper. These differences need to be fully explained and accounted for before the paper can be considered for publication. - In addition, since TabSyn and TABDAR perform similarly and address the same problem, the advantages and differences between the two approaches should be discussed. - As a method that simply integrates diffusion modeling into bidirectional transformers to handle continuo
The authors present state-of-the-art empirical results (by some margin). The proposed method is based on interesting observations and to the best of my knowledge, the way they combine multi-modal signals is novel.
- The authors claim that columns are permutation invariant, which I tend to agree with. However, combining this observation with autoregressive generation is impossible. From the manuscript, I couldn't figure out if the method combines lower triangular mask and permutations of the tokens. Can the authors clarify? - Superscript and subscripts - it seems like the authors used mixed notations in different sections (I think?). E.g., in equation (1), items in a sequence are subscript and then $<i$ is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
MethodsDiffusion
