Score Augmentation for Diffusion Models
Liang Hou, Yuan Gao, Boyuan Jiang, Xin Tao, Qi Yan, Renjie Liao, Pengfei Wan, Di Zhang, Kun Gai

TL;DR
This paper introduces Score Augmentation, a novel data augmentation method for diffusion models that applies transformations to noisy data, helping to reduce overfitting and improve performance across various benchmarks.
Contribution
The paper proposes ScoreAug, a new augmentation framework that operates on noisy data and enforces equivariant learning, enhancing diffusion model training and performance.
Findings
ScoreAug significantly improves performance on CIFAR-10, FFHQ, AFHQv2, and ImageNet.
ScoreAug effectively reduces overfitting in data-limited regimes.
ScoreAug can be combined with traditional augmentations for further gains.
Abstract
Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
It's interesting to confirm that the transformation at the noisy space works as much as the original EDM training.
### W1. Marginal empirical gains. - Table 2 already shows EDM w/ NLA ≈ 2.1 FID, while the proposed method achieves only ~0.05–0.1 improvement—well within run-to-run variance. - The authors describe this as a “consistent performance improvement,” but such a small delta is unlikely to be statistically meaningful, particularly given the stochasticity of diffusion model training. - Without repeated trials or confidence intervals, it is difficult to tell whether any real gain exists or if this refle
- The paper aims to tackle an important and practical problem, providing an additional (and relatively understudied) perspective on improving the training of diffusion models. - Figure 1 and Table 1 provide a clear and intuitive illustration of how ScoreAug works. - The study presented in Figure 2 strongly supports the claim that ScoreAug effectively mitigates overfitting with limited data or excessive model capacity.
- The experimental setups described in Table 2 are very confusing, particularly the meaning of $+$ and $\times$. - Accoring to my understanding: - "EDM w/ NLA" = using a non-linear transformation pipeline (described in Table 6 of their paper), and conditioning the model with the 9-dim aug label vector $a$. - "ScoreAug(Linear)" = using one randomly selected linear transformation (Appx. B), and conditioning the model with the vector $\omega$ (Line 265). - "ScoreAug(type 1/2)" = using
1. Method–process alignment: Augmenting in the noisy space with an equivariant target is principled and avoids the mismatch of clean-only augmentation; it also mitigates augmentation leakage with proper conditioning. 2. A clear correspondence between scores under general transformations (Theorem 1) supports the design beyond linear cases. 3. Broad empirical coverage: Consistent improvements over EDM baselines (with/without non-leaky aug) and additional gains on SiT (ImageNet-256), indicating
1. Ablation depth: Type I vs. Type II under nonlinear transforms and the exact role/sensitivity of conditioning (and its injection method) deserve deeper, more granular analysis beyond a few tables. 2. Compute reporting: Claims that resources “do not increase significantly” are not quantified (e.g., wall-clock, GPU-hours) across datasets; clearer cost–benefit and sampling-time impacts would help adoption. 3. Metric/setting breadth: Heavy reliance on FID, with limited diversity metrics or hum
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning
