Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models

Dang Nguyen; Jiping Li; Jinghao Zheng; Baharan Mirzasoleiman

arXiv:2505.21574·cs.CV·March 5, 2026

Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models

Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman

PDF

Open Access 3 Reviews

TL;DR

TADA is a targeted data augmentation framework using diffusion models that selectively enhances under-learned examples, leading to improved generalization across various architectures and datasets with reduced computational cost.

Contribution

The paper introduces TADA, a novel targeted augmentation method that selectively augments a subset of data, outperforming full dataset augmentation and reducing computational overhead.

Findings

01

Augmenting only 30-40% of data improves accuracy by up to 2.8%.

02

TADA outperforms full augmentation and state-of-the-art optimizers.

03

Effective on multiple architectures and tasks, including object detection.

Abstract

Synthetically augmenting training datasets with diffusion models has become an effective strategy for improving the generalization of image classifiers. However, existing approaches typically increase dataset size by 10-30x and struggle to ensure generation diversity, leading to substantial computational overhead. In this work, we introduce TADA (TArgeted Diffusion Augmentation), a principled framework that selectively augments examples that are not learned early in training using faithful synthetic images that preserve semantic features while varying noise. We show that augmenting only this targeted subset consistently outperforms augmenting the entire dataset. Through theoretical analysis on a two-layer CNN, we prove that TADA improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Extensive experiments demonstrate that by augmenting only…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

The dominant strength is that the current data augmentation paper only focuses on how to generate data with high fidelity and diversity for a more robust decision boundary. However, a very small paper focuses on how to balance the real set and the synthetic set during the training process. This paper fills the blank for current generative-based data augmentation research.

Weaknesses

This method is general, but the evaluations are limited. 1/ The evaluated backbones are too weak, and whether better-pretrained backbones can overlay the benefit of your method. 2/ Since this method is a plug-and-play module, why not evaluate it based on more state-of-the-art methods like [1,2,3,4]? Meanwhile, you should at least discuss them in the related work. 3/ Lack of evaluations on fine-grained datasets. 4/ This method seems like can be applied not only for image classification datase

Reviewer 02Rating 4Confidence 3

Strengths

- The central idea of targeting slow-learning samples for augmentation is novel and intuitive. The rationale that focusing augmentation efforts on more challenging examples seems a logical approach to improving model robustness and generalization. - The paper provides extensive empirical validation across three different datasets, showing credibility to the proposed method's effectiveness. The observation regarding the characteristics of slow-learned samples is particularly interesting and furth

Weaknesses

- The theoretical analysis relies on a simplified two-layer CNN assumption. This raises questions about the direct applicability and relevance of the derived theorems to the deeper, more complex architectures commonly used in practice. The paper would be strengthened by a discussion bridging this theoretical gap. - I have concerns regarding the significant computational overhead of the proposed method. Utilizing a diffusion model for data generation, even for a subset of the data, is inherently

Reviewer 03Rating 4Confidence 3

Strengths

**Efficiency**: Augmenting only 30–40% of data outperforms full-dataset augmentation, offering a practical, resource-aware solution. **Empirical Support**: Ablation studies on augmentation factors and initialization provide useful insights. **Compatibility**: Works well with existing methods (e.g., SAM), boosting performance further.

Weaknesses

**Missing Prior**: The method overlaps with "Boomerang" [1], which uses similar noise-add-and-denoise techniques for data augmentation for classification, but it’s not cited or compared. Notably, they use all of the dataset for synthetic data generation, and they see gains in accuracy, which contradict experiments in this paper. **Theory-Practice Gap**: The claim of mimicking SAM’s feature learning (e.g., sections 4.1–4.2 suggest SAM-like noise suppression and uniform learning) doesn’t fully a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis