Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

Dawei Li; Yue Huang; Ming Li; Tianyi Zhou; Xiangliang Zhang; Huan Liu

arXiv:2508.19570·cs.LG·August 28, 2025

Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

Dawei Li, Yue Huang, Ming Li, Tianyi Zhou, Xiangliang Zhang, Huan Liu

PDF

TL;DR

This paper reviews how generative models like LLMs, diffusion models, and GANs are transforming synthetic data creation, addressing data scarcity, privacy, and annotation issues in data mining with practical frameworks and evaluation methods.

Contribution

It provides a comprehensive overview of recent advances, methodologies, and applications of generative models for synthetic data in data mining.

Findings

01

Generative models effectively address data scarcity and privacy concerns.

02

Synthetic data enhances data mining research and practice.

03

The paper discusses evaluation strategies for synthetic data quality.

Abstract

Generative models such as Large Language Models, Diffusion Models, and generative adversarial networks have recently revolutionized the creation of synthetic data, offering scalable solutions to data scarcity, privacy, and annotation challenges in data mining. This tutorial introduces the foundations and latest advances in synthetic data generation, covers key methodologies and practical frameworks, and discusses evaluation strategies and applications. Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice. More information can be found on our website: https://syndata4dm.github.io/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.