Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

Shuo Yang; Zheyu Zhang; Bardh Prenkaj; Gjergji Kasneci

arXiv:2507.19334·cs.LG·July 28, 2025

Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci

PDF

Open Access 1 Video

TL;DR

This paper introduces SPADA, a fast and lightweight method for generating high-quality synthetic tabular data by modeling sparse feature dependencies with an LLM-induced graph, significantly reducing computation time and bias.

Contribution

SPADA explicitly captures sparse feature dependencies using an LLM-induced graph, enabling ultra-fast and less biased tabular data augmentation compared to existing dense dependency models.

Findings

01

Reduces constraint violations by 4% compared to diffusion methods.

02

Accelerates data generation by nearly 9,500 times over LLM-based baselines.

03

Maintains high data quality with lower bias.

Abstract

Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs· underline

Taxonomy

TopicsMachine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis · Topic Modeling