Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets
Haoye Lu, Qifan Wu, Yaoliang Yu

TL;DR
This paper introduces a novel stochastic deconvolution method for training diffusion models on noisy datasets, demonstrating that limited clean data pretraining significantly improves learning and results in high-quality image generation.
Contribution
The paper proposes the Stochastic Forward-Backward Deconvolution (SFBD) technique and shows that pretraining with a small amount of clean data enables effective learning from noisy datasets.
Findings
Achieved FID 6.31 on CIFAR-10 with 4% clean images
Theoretical guarantees for SFBD learning the true data distribution
Pretraining on limited clean data enhances diffusion model performance
Abstract
Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain FID 6.31 on CIFAR-10 with just 4% clean images (and 3.58 with 10%). We also provide theoretical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGaussian Processes and Bayesian Inference
