Slight Corruption in Pre-training Data Makes Better Diffusion Models
Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou,, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

TL;DR
This study shows that slight corruption in pre-training data can unexpectedly improve the quality and diversity of diffusion models' generated outputs, supported by empirical and theoretical analysis.
Contribution
It is the first comprehensive analysis of how minor data corruption during pre-training can enhance diffusion models' performance and introduces a simple method called CEP to leverage this effect.
Findings
Slight corruption improves image quality, diversity, and fidelity.
Theoretical proof that corruption increases entropy and reduces Wasserstein distance.
Condition embedding perturbations (CEP) enhance model performance.
Abstract
Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference
