EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular   Data Classification via Large Language Models

Jinhee Kim; Taesung Kim; Jaegul Choo

arXiv:2404.12404·cs.LG·January 15, 2025·1 cites

EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models

Jinhee Kim, Taesung Kim, Jaegul Choo

PDF

Open Access 1 Repo 1 Video

TL;DR

EPIC is a novel method that uses large language models with optimized prompts to generate high-quality synthetic tabular data, effectively addressing class imbalance and improving classification performance.

Contribution

This work introduces EPIC, a new prompt design strategy that enhances LLM-based synthetic data generation for imbalanced tabular datasets, achieving state-of-the-art results.

Findings

01

EPIC outperforms existing methods in classification accuracy.

02

EPIC significantly improves data generation efficiency.

03

EPIC effectively handles class imbalance in real-world datasets.

Abstract

Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance. Our source code for our work is available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seharanul17/synthetic-tabular-LLM
noneOfficial

Videos

EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis