CTSyn: A Foundation Model for Cross Tabular Data Generation
Xiaofeng Lin, Chenheng Xu, Matthew Yang, Guang Cheng

TL;DR
CTSyn is a diffusion-based foundation model designed for generating high-quality synthetic tabular data by consolidating diverse tables into a unified latent space and conditioning on schema information.
Contribution
It introduces a novel autoencoder and diffusion model framework for heterogeneous tabular data generation, outperforming existing methods on standard benchmarks.
Findings
Outperforms existing table synthesizers in utility and diversity
Effective handling of heterogeneous datasets through schema-conditioned reconstruction
Establishes a foundation for large-scale tabular generative models
Abstract
Generative Foundation Models (GFMs) have achieved remarkable success in producing high-quality synthetic data for images and text. However, their application to tabular data presents significant challenges due to the heterogeneous nature of table features. Current cross-table learning frameworks struggle because they lack a generative model backbone and an effective mechanism to decode heterogeneous feature values. To address these challenges, we propose the Cross-Table Synthesizer (CTSyn), a diffusion-based generative foundation model for tabular data generation. CTSyn comprises two key components. The first is an autoencoder network that consolidates diverse tables into a unified latent space. It dynamically reconstructs table values using a table schema embedding, allowing adaptation to heterogeneous datasets. The second is a conditional latent diffusion model that generates samples…
Peer Reviews
Decision·ICLR 2025 Poster
1. While there have been prior works employing auto-encoders + latent diffusion models towards tabular data synthesis (e.g. https://arxiv.org/abs/2310.09656), prior works dealing with heterogenous table synthesis have been limited. 2. The authors provide good comparison against baselines on a good range of datasets -- for fidelity, privacy and ML utility of generated data.
1. While a common latent space across tables provides a strategy to work on heterogeneous tables, it certainly limits the ability to interpret what the embeddings in the space mean - and the authors have not studied this aspect (to clarify, this is different from the privacy plots) 2. The pre-trained LM to emit embeddings for rows - individually for each column type, while preserving tabular structure - raises questions about scalability to enterprise tables - which have thousands of columns ass
A strength of this work is the successful application of a straightforward approach to map embedding vectors of categorical variables back to their original space. Although simple, this method effectively demonstrates that returning to the original categorical space can be achieved without complex transformations, providing a useful baseline for handling categorical data embeddings.
A limitation of this work is that it primarily proposes a method for handling individual variables within a framework similar to LSGM [1] that trains a diffusion model in latent space. As such, the approach lacks substantial novelty and may have limited impact, given that it focuses on variable handling within an established generative model framework rather than introducing fundamentally new techniques. Minors. Typos in line 129 (specific) and 266 (The) [1] Vahdat, Arash, Karsten Kreis, and J
- The paper is well-written and the proposed technique is detailed clearly. - The results clearly demonstrate the prowess of CTSyn in terms of matching the training data at least as captured by column-wise statistical metrics (Table 2). - Figure 2 and Table 3 demonstrate the ability of CTSyn to maintain privacy of training data (compared to other non differentially-private synthetic tabular data generators) while still learning useful representations. This demonstrates that the model is capabl
- The training process employing a diffusion model requires costly pre-training and fine-tuning hence scaling the modeling pipeline to large tables (e.g., 100s of columns, millions of rows) may be challenging. - One crucial facet of the paper that is lacking clarification is a description of the meta-data for the various tables employed. - Further, as the current model is termed as a foundation model for tabular data generation, it is crucial to demonstrate its effectiveness on noisy training
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Synthesizer · Diffusion · Latent Diffusion Model
