How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion
Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis

TL;DR
This paper investigates how training diffusion models on noisy data impacts performance, demonstrating that combining a small amount of clean data with large noisy datasets can achieve near state-of-the-art results, supported by theoretical bounds.
Contribution
It provides the first large-scale empirical analysis of training diffusion models with noisy data and introduces novel theoretical bounds for learning from Gaussian mixtures with heterogeneous variances.
Findings
Pure noisy data cannot match clean data performance at large sample sizes.
Adding a small amount of clean data to noisy data achieves near state-of-the-art results.
Theoretical bounds show noisy samples have exponentially less utility than clean samples.
Abstract
The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than models on data with different corruption levels across three datasets ranging from to M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~ of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗giannisdaras/ambient_laws_imagenet_sigma_0.2_corruption_0.9_keep_1.0model
- 🤗giannisdaras/ambient_laws_imagenet_sigma_0.2_corruption_0.1_keep_1.0model
- 🤗giannisdaras/ambient_laws_cifar_sigma_0.2_corruption_0.1_keep_1.0model· 2 dl2 dl
- 🤗giannisdaras/ambient_laws_celeba_sigma_0.2_corruption_0.1_keep_1.0model· 1 dl1 dl
- 🤗giannisdaras/ambient_laws_imagenet_sigma_0.2_corruption_0.3_keep_1.0model
- 🤗giannisdaras/ambient_laws_cifar_sigma_0.2_corruption_0.3_keep_1.0model
- 🤗giannisdaras/ambient_laws_celeba_sigma_0.2_corruption_0.3_keep_1.0model
- 🤗giannisdaras/ambient_laws_imagenet_sigma_0.2_corruption_0.5_keep_1.0model
- 🤗giannisdaras/ambient_laws_cifar_sigma_0.2_corruption_0.5_keep_1.0model· 1 dl1 dl
- 🤗giannisdaras/ambient_laws_celeba_sigma_0.2_corruption_0.5_keep_1.0model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment
MethodsDiffusion · Sparse Evolutionary Training
