FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Anh Nguyen; Sam Schafft; Nicholas Hale; John Alfaro

arXiv:2507.15839·cs.LG·July 22, 2025

FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro

PDF

TL;DR

FASTGEN introduces a novel approach to synthetic tabular data generation using LLMs that significantly reduces time and cost by generating reusable sampling scripts based on field distributions, enabling scalable and realistic data synthesis.

Contribution

The paper presents a new method that leverages LLMs to create distribution-based scripts for efficient, large-scale synthetic tabular data generation, outperforming traditional direct inference approaches.

Findings

01

Outperforms traditional methods in data diversity and realism

02

Reduces time and cost for large-scale data synthesis

03

Enables scalable synthetic data generation without continuous model inference

Abstract

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.