Language Models are Realistic Tabular Data Generators
Vadim Borisov, Kathrin Se{\ss}ler, Tobias Leemann, Martin Pawelczyk,, Gjergji Kasneci

TL;DR
This paper introduces GReaT, a transformer-based model that generates highly realistic synthetic tabular data by modeling complex distributions and conditioning on feature subsets, outperforming previous methods.
Contribution
The paper presents GReaT, a novel LLM-based approach for tabular data generation that effectively models data distributions and allows flexible conditioning on features.
Findings
GReaT achieves state-of-the-art performance on multiple datasets.
The generated data maintains high validity and quality.
GReaT effectively models heterogeneous feature types.
Abstract
Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
