Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

Haoye Lu; Qifan Wu; Yaoliang Yu

arXiv:2502.05446·cs.LG·June 4, 2025

Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

Haoye Lu, Qifan Wu, Yaoliang Yu

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel stochastic deconvolution method for training diffusion models on noisy datasets, demonstrating that limited clean data pretraining significantly improves learning and results in high-quality image generation.

Contribution

The paper proposes the Stochastic Forward-Backward Deconvolution (SFBD) technique and shows that pretraining with a small amount of clean data enables effective learning from noisy datasets.

Findings

01

Achieved FID 6.31 on CIFAR-10 with 4% clean images

02

Theoretical guarantees for SFBD learning the true data distribution

03

Pretraining on limited clean data enhances diffusion model performance

Abstract

Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain FID 6.31 on CIFAR-10 with just 4% clean images (and 3.58 with 10%). We also provide theoretical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Stochastic Forward–Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets· slideslive

Taxonomy

TopicsGaussian Processes and Bayesian Inference