The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford,, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna

TL;DR
This paper demonstrates that directly retrieving real images from datasets outperforms synthetic images generated by models like Stable Diffusion for training vision classifiers, highlighting the limitations of synthetic data.
Contribution
The study provides a comprehensive comparison showing that retrieval of real images surpasses synthetic data in training effectiveness, challenging the reliance on generative models for synthetic training data.
Findings
Real image retrieval outperforms synthetic data in training vision models.
Synthetic images suffer from artifacts and inaccurate details affecting performance.
Targeted retrieval is a strong baseline that current synthetic methods do not surpass.
Abstract
Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSurgical Simulation and Training
MethodsDiffusion
