Language Models are Realistic Tabular Data Generators

Vadim Borisov; Kathrin Se{\ss}ler; Tobias Leemann; Martin Pawelczyk,; Gjergji Kasneci

arXiv:2210.06280·cs.LG·April 25, 2023·45 cites

Language Models are Realistic Tabular Data Generators

Vadim Borisov, Kathrin Se{\ss}ler, Tobias Leemann, Martin Pawelczyk,, Gjergji Kasneci

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces GReaT, a transformer-based model that generates highly realistic synthetic tabular data by modeling complex distributions and conditioning on feature subsets, outperforming previous methods.

Contribution

The paper presents GReaT, a novel LLM-based approach for tabular data generation that effectively models data distributions and allows flexible conditioning on features.

Findings

01

GReaT achieves state-of-the-art performance on multiple datasets.

02

The generated data maintains high validity and quality.

03

GReaT effectively models heterogeneous feature types.

Abstract

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kathrinse/be_great
pytorchOfficial

Videos

Language Models are Realistic Tabular Data Generators· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications