Exploring the Heterogeneity of Tabular Data: A Diversity-aware Data Generator via LLMs
Yafeng Tang, Xiaoou Ding, Jianzhuo Du, Zishuo Yan, Zhuang Ma, Zheng Liang, Zekai Qian, Hongzhi Wang

TL;DR
This paper introduces DATE, a diversity-aware data generator using LLMs that partitions heterogeneous data and balances diversity with quality, significantly improving tabular data generation for machine learning tasks.
Contribution
The paper proposes a novel framework DATE that effectively partitions data, explores diversity with LLMs, and employs a Multi-Arm Bandit algorithm to balance diversity and quality, outperforming existing methods.
Findings
Achieves 23.75% reduction in error rate on benchmarks.
Improves accuracy of DPO and reasoning of LLMs.
Outperforms state-of-the-art GAN and LLM-based data generators.
Abstract
Tabular data generation has become increasingly essential for enabling robust machine learning applications, which require large-scale, high-quality data. Existing solutions leverage generative models to learn original data distributions. However, real-world data are naturally heterogeneous with diverse distributions, making it challenging to obtain a universally good model for diverse data generation. To address this limitation, we introduce Diversity-Aware Tabular data gEnerator (DATE), a framework that (i) prepares high-quality and distributionally distinct examples for in-context learning by effectively partitioning the original heterogeneous data into multiple diverse subsets; (ii) harnesses Large Language Models (LLMs) to explore the diversity of the partitioned distribution with decision tree reasoning as feedback, generating high-quality labeled data for each subset. However,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Mobile Crowdsensing and Crowdsourcing
