Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

Yiwen Liu; Jessica Bader; Jae Myung Kim

arXiv:2505.10551·cs.CV·May 16, 2025

Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

Yiwen Liu, Jessica Bader, Jae Myung Kim

PDF

Open Access 1 Repo

TL;DR

This study investigates whether enforcing the realism of synthetic images, termed feasibility, significantly impacts the performance of CLIP-based classifiers trained on such data, finding minimal effects in most cases.

Contribution

The paper introduces VariReal, a pipeline for minimally editing images to control feasibility, and provides empirical evidence that feasibility has limited impact on classifier performance.

Findings

01

Feasibility minimally affects CLIP performance, with less than 0.3% accuracy difference.

02

The impact of feasibility depends on specific attributes and their adversarial influence.

03

Mixing feasible and infeasible images does not significantly change results.

Abstract

With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model's ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yiveen/syntheticdatafeasibility
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Resource Development and Performance Evaluation · Intelligent Tutoring Systems and Adaptive Learning

MethodsDiffusion · Sparse Evolutionary Training · Contrastive Language-Image Pre-training