A Note on Statistically Accurate Tabular Data Generation Using Large Language Models
Andrey Sidorenko

TL;DR
This paper presents a probability-driven prompting method for large language models to generate more statistically accurate synthetic tabular data by better capturing complex feature dependencies.
Contribution
It introduces a novel prompting approach that leverages LLMs to estimate conditional distributions, improving the fidelity of synthetic tabular data.
Findings
Enhanced preservation of feature dependencies in generated data
Scalable approach for complex categorical variables
Improved statistical fidelity of synthetic datasets
Abstract
Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
