Exploring the Potential of Synthetic Data to Replace Real Data
Hyungtae Lee, Yan Zhang, Heesung Kwon, Shuvra S., Bhattacharrya

TL;DR
This paper investigates how synthetic data can effectively replace real data in training AI models, especially when combined with limited real images from different domains, and introduces new metrics to evaluate this potential.
Contribution
It introduces two novel metrics, train2test distance and $ ext{AP}_ ext{t2t}$, to assess the effectiveness of synthetic data in cross-domain training scenarios.
Findings
Synthetic data's effectiveness varies with the number of cross-domain real images.
The test set influences the success of synthetic data in training.
New metrics provide insights into synthetic data's ability to represent test characteristics.
Abstract
The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and , to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Time Series Analysis and Forecasting · Data Management and Algorithms
MethodsSparse Evolutionary Training
