A Survey of Data Synthesis Approaches
Hsin-Yu Chang, Pei-Yu Chen, Tun-Hsiang Chou, Chang-Sheng Kao,, Hsuan-Yun Yu, Yen-Ting Lin, Yun-Nung Chen

TL;DR
This survey comprehensively reviews synthetic data techniques, their goals, categories, filtering strategies, and future directions, providing a structured overview of the field for researchers and practitioners.
Contribution
It offers a detailed classification of synthetic data approaches, filtering methods, and highlights key future research directions in synthetic data generation.
Findings
Synthetic data aims to improve diversity, balance data, address domain shifts, and resolve edge cases.
Synthetic data techniques are categorized into expert-knowledge, direct training, pre-train then fine-tune, and foundation models.
Future directions include focusing on quality, evaluation methods, and multi-model data augmentation.
Abstract
This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries
MethodsFocus
