How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

Giannis Daras; Yeshwanth Cherapanamjeri; Constantinos Daskalakis

arXiv:2411.02780·cs.LG·November 6, 2024

How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper investigates how training diffusion models on noisy data impacts performance, demonstrating that combining a small amount of clean data with large noisy datasets can achieve near state-of-the-art results, supported by theoretical bounds.

Contribution

It provides the first large-scale empirical analysis of training diffusion models with noisy data and introduces novel theoretical bounds for learning from Gaussian mixtures with heterogeneous variances.

Findings

01

Pure noisy data cannot match clean data performance at large sample sizes.

02

Adding a small amount of clean data to noisy data achieves near state-of-the-art results.

03

Theoretical bounds show noisy samples have exponentially less utility than clean samples.

Abstract

The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30, 000$ to $\approx 1.3$ M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~ $10%$ of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

giannisdaras/ambient-laws
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment

MethodsDiffusion · Sparse Evolutionary Training