FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs
Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro

TL;DR
FASTGEN introduces a novel approach to synthetic tabular data generation using LLMs that significantly reduces time and cost by generating reusable sampling scripts based on field distributions, enabling scalable and realistic data synthesis.
Contribution
The paper presents a new method that leverages LLMs to create distribution-based scripts for efficient, large-scale synthetic tabular data generation, outperforming traditional direct inference approaches.
Findings
Outperforms traditional methods in data diversity and realism
Reduces time and cost for large-scale data synthesis
Enables scalable synthetic data generation without continuous model inference
Abstract
Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
