A Note on Statistically Accurate Tabular Data Generation Using Large   Language Models

Andrey Sidorenko

arXiv:2505.02659·cs.LG·May 7, 2025

A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Andrey Sidorenko

PDF

Open Access 1 Repo

TL;DR

This paper presents a probability-driven prompting method for large language models to generate more statistically accurate synthetic tabular data by better capturing complex feature dependencies.

Contribution

It introduces a novel prompting approach that leverages LLMs to estimate conditional distributions, improving the fidelity of synthetic tabular data.

Findings

01

Enhanced preservation of feature dependencies in generated data

02

Scalable approach for complex categorical variables

03

Improved statistical fidelity of synthetic datasets

Abstract

Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mostly-ai/paper-datallm-materials
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling