EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models
Jinhee Kim, Taesung Kim, Jaegul Choo

TL;DR
EPIC is a novel method that uses large language models with optimized prompts to generate high-quality synthetic tabular data, effectively addressing class imbalance and improving classification performance.
Contribution
This work introduces EPIC, a new prompt design strategy that enhances LLM-based synthetic data generation for imbalanced tabular datasets, achieving state-of-the-art results.
Findings
EPIC outperforms existing methods in classification accuracy.
EPIC significantly improves data generation efficiency.
EPIC effectively handles class imbalance in real-world datasets.
Abstract
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance. Our source code for our work is available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
