Generating Realistic Tabular Data with Large Language Models
Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Svetha Venkatesh

TL;DR
This paper introduces a novel LLM-based method for generating realistic and diverse synthetic tabular data that accurately captures feature-target correlations, outperforming existing methods in downstream predictive tasks.
Contribution
The paper presents three key improvements—permutation strategy, feature-conditional sampling, and prompt-based label generation—that enhance LLMs' ability to generate high-quality tabular data.
Findings
Outperforms 10 SOTA baselines on 20 datasets
Synthetic data enables classifiers to match original data performance
Produces highly realistic and diverse synthetic samples
Abstract
While most generative models show achievements in image data generation, few are developed for tabular data generation. Recently, due to success of large language models (LLM) in diverse tasks, they have also been used for tabular data generation. However, these methods do not capture the correct correlation between the features and the target variable, hindering their applications in downstream predictive tasks. To address this problem, we propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. First, we propose a novel permutation strategy for the input data in the fine-tuning phase. Second, we propose a feature-conditional sampling approach to generate synthetic samples. Finally, we generate the labels by constructing prompts based on the generated samples to query our fine-tuned LLM. Our extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
