Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era
Dawei Li, Yue Huang, Ming Li, Tianyi Zhou, Xiangliang Zhang, Huan Liu

TL;DR
This paper reviews how generative models like LLMs, diffusion models, and GANs are transforming synthetic data creation, addressing data scarcity, privacy, and annotation issues in data mining with practical frameworks and evaluation methods.
Contribution
It provides a comprehensive overview of recent advances, methodologies, and applications of generative models for synthetic data in data mining.
Findings
Generative models effectively address data scarcity and privacy concerns.
Synthetic data enhances data mining research and practice.
The paper discusses evaluation strategies for synthetic data quality.
Abstract
Generative models such as Large Language Models, Diffusion Models, and generative adversarial networks have recently revolutionized the creation of synthetic data, offering scalable solutions to data scarcity, privacy, and annotation challenges in data mining. This tutorial introduces the foundations and latest advances in synthetic data generation, covers key methodologies and practical frameworks, and discusses evaluation strategies and applications. Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice. More information can be found on our website: https://syndata4dm.github.io/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
