Augmented Conditioning Is Enough For Effective Training Image Generation
Jiahui Chen, Amy Zhang, Adriana Romero-Soriano

TL;DR
This paper demonstrates that augmenting real images and text prompts for conditioning in diffusion models enhances the diversity and realism of generated images, improving their utility for training downstream image classifiers without model fine-tuning.
Contribution
The study introduces a simple augmentation-conditioning method that significantly boosts synthetic data quality for training classifiers, outperforming existing approaches across multiple benchmarks.
Findings
Improved classifier performance on long-tail and few-shot benchmarks.
Augmentation-conditioning yields consistent gains over state-of-the-art methods.
Effective synthetic data generation without fine-tuning the diffusion model.
Abstract
Image generation abilities of text-to-image diffusion models have significantly advanced, yielding highly photo-realistic images from descriptive text and increasing the viability of leveraging synthetic images to train computer vision models. To serve as effective training data, generated images must be highly realistic while also sufficiently diverse within the support of the target data distribution. Yet, state-of-the-art conditional image generation models have been primarily optimized for creative applications, prioritizing image realism and prompt adherence over conditional diversity. In this paper, we investigate how to improve the diversity of generated images with the goal of increasing their effectiveness to train downstream image classification models, without fine-tuning the image generation model. We find that conditioning the generation process on an augmented real image…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper introduces an approach of augmentation-conditioning, which leverages real images with data augmentations to create synthetic images that are both realistic and diverse. This method bridges the domain gap between synthetic and real data, enhancing downstream classification performance without requiring extensive fine-tuning of the diffusion model. 2. The method’s effectiveness is demonstrated across multiple challenging benchmarks, including long-tail and few-shot classification task
1. The technical novelty of the proposed method seems limited, as it mainly combines existing data augmentations, like Mixup, before inputting images into an existing diffusion model. More discussion of method’s novelty is necessary. Besides, to better demonstrate the effectiveness of the proposed method, it would be beneficial to consider more recent tuning-free approaches for diffusion models, such as [1]. Additional discussion and experiments comparing the superiority of the proposed method w
- The approach appears valid and effective for LT and few-shot classification, as demonstrated in the experiments. The authors also tested various augmentations. - Conditioning on both the augmented image and text prompt seems effective for improving performance on classification tasks. - Experiments on different values of classifier-free guidance (CFG) are interesting, especially regarding how the optimal scale varies by task.
- The technical novelty of this paper is unclear. The concept of combining both augmented images and text prompts seems useful for LT and few-shot classification but lacks novelty. If this approach is not technically original, the paper should at least show a broad variety of downstream tasks that benefit from it, which it did not. - The contribution is not clearly articulated. Although it’s evident that the synthetic dataset is effective, it’s unclear for which specific tasks it is most useful.
- The problems and the proposed method are clearly presented and easy to understand. - The proposed method is effective and straightforward in practice, with extensive ablation experiments conducted to support its design. - The synthesized training data demonstrates strong performance in few-shot classification.
In Table 3, the Fill-Up method demonstrates higher accuracy than the proposed method. Although the positive correlation between accuracy and training dataset scale is discussed in line 376, it remains unclear whether the proposed method can outperform Fill-Up. Given the computing constraints, I suggest the authors: - Illustrate the positive relationship between accuracy and dataset scale using data synthesized by the proposed method. - Provide results for Fill-Up with different, smaller synthet
A) Writing is overall quite clear. B) Experiments are varied across areas (long-tail, few-shot). C) Method section shows many qualitative examples to supplement the quantitative results in the experimental section. D) Some of the results show improvement.
Points are ordered roughly according to my perceived scale, with more important points being listed first. A) The method itself is quite simplistic from a novelty perspective (simply adding augmentations to the conditioning). I would consider this a strength, if the results were consistent (B) and strong (B, D, E, F) with a clear storyline for effective use-cases (C). However, I do not see this as being the case (see following points for details, as indicated in the corresponding parentheses).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training
MethodsDiffusion
