The Unmet Promise of Synthetic Training Images: Using Retrieved Real   Images Performs Better

Scott Geng; Cheng-Yu Hsieh; Vivek Ramanujan; Matthew Wallingford,; Chun-Liang Li; Pang Wei Koh; Ranjay Krishna

arXiv:2406.05184·cs.CV·January 3, 2025

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford,, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper demonstrates that directly retrieving real images from datasets outperforms synthetic images generated by models like Stable Diffusion for training vision classifiers, highlighting the limitations of synthetic data.

Contribution

The study provides a comprehensive comparison showing that retrieval of real images surpasses synthetic data in training effectiveness, challenging the reliance on generative models for synthetic training data.

Findings

01

Real image retrieval outperforms synthetic data in training vision models.

02

Synthetic images suffer from artifacts and inaccurate details affecting performance.

03

Targeted retrieval is a strong baseline that current synthetic methods do not surpass.

Abstract

Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scottgeng00/unmet-promise
pytorchOfficial

Datasets

scottgeng00/unmet-promise
dataset· 498 dl
498 dl

Videos

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better· slideslive

Taxonomy

TopicsSurgical Simulation and Training

MethodsDiffusion