Ambient Dataloops: Generative Models for Dataset Refinement
Adri\'an Rodr\'iguez-Mu\~noz, William Daspit, Adam Klivans, Antonio Torralba, Constantinos Daskalakis, Giannis Daras

TL;DR
Ambient Dataloops introduces an iterative dataset refinement framework using diffusion models, enhancing data quality and model performance in image generation and protein design through a co-evolution process.
Contribution
The paper presents a novel co-evolution framework that iteratively refines datasets and models using Ambient Diffusion, improving generative performance on complex tasks.
Findings
Achieves state-of-the-art results in image generation
Improves de novo protein design quality
Provides theoretical justification for the data looping process
Abstract
We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a…
Peer Reviews
Decision·Submitted to ICLR 2026
- The approach is intuitive and practically motivated, addressing a real challenge in diffusion training when high-quality data is scarce. - The paper provides theoretical justification explaining why introducing controlled corruption can improve model robustness and training dynamics. - The empirical evaluation is broad and well-structured, spanning three directions and including ablation studies to demonstrate the contribution of each component.
- The real-world evaluation in Section 5.2 is limited to MicroDiffusion. It remains unclear how Ambient Dataloops scale to larger, widely used diffusion models (+ finetuning). For practical adoption, evidence on larger models would be important.
- There is some empirical evidence that one loop is helpful. - There are a good number of experiments.
- Several important experimental details are missing (training time, training hyperparameters, diffusion sampling). - No confidence intervals or statistical tests are presented in the results tables, so no conclusions about model performance can be drawn. - Metrics are also under-described. For example, how many samples were used to compute FIDs? These details should be included in the appendix. - The empirical benefits and contributions are mostly due to ambient diffusion–the looping does not s
1. Paper leverages generative models to refine its own training data. 2. The technicality of method is reasonable and well motivated.
1. The root of method is mainly based on AmbientDiffusion. Need to justify on the difference. 2. CIFAR experiment is too small and not meaningful (see Table 1) since the gain is quite margnial and the resolution is very low (at 32x32). Need more large-scale experiment on ImageNet. 3. "Our framework trains to extract as much utility as possible from a given training set; if there is more data available, it is always better to perform training updates on it as fresh samples reveal more about the u
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Domain Adaptation and Few-Shot Learning
