Masked Diffusion Models Are Fast Distribution Learners
Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo, Wang, Zhenguang Liu, Kui Ren

TL;DR
This paper introduces a two-stage training framework for diffusion models that significantly accelerates training and improves performance by pre-training on masked images to learn a primer distribution, then fine-tuning for specific tasks.
Contribution
The authors propose a masked pre-training method for diffusion models that reduces training time and enhances generalization across datasets.
Findings
Achieved a new FID score of 6.27 on CelebA-HQ 256x256.
Pre-trained models show 46% quality improvement when fine-tuned on different datasets.
Significant acceleration in training process compared to traditional diffusion models.
Abstract
Diffusion model has emerged as the \emph{de-facto} model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all fine-grained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution that loosely characterizes the unknown real image distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently. In the pre-training stage, we propose to mask a high proportion (e.g., up to 90\%) of input images to approximately represent the primer distribution and introduce a masked denoising score matching objective to train a model to denoise…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. Proposes a simple to implement pre-training technique for improving diffusion model convergence and accuracy. 2. Show results on known dataset and compare against multiple other published models. 3. In theory section, they try to formulate why this works.
1. In experiments it doesn't detail the time advantage of pre-training. 2. In experiments no detail on amount of fine-tuning and ablations on sensitivity to it 3. Overall the masking seems very hyper parameter sensitive as detailed in 4.2. This is further detailed in Appendix
- This work delves into a critical aspect of diffusion models, focusing on the enhancement of their training efficiency. - The idea proposed in this work is notably straightforward, and the results it yields are indeed promising. - The writing is generally clear and largely accessible. - The inclusion of a comprehensive ablation study is a strong point, and the validation of design choices adds credibility to the approach.
**Inconsistent/missing baselines**: The paper presents varying baseline models in Tables 2-4, all of which are generative models that can be conceptually reused in these experiments. Alternating between baselines in different experiments can be perplexing and lead to potential misinterpretation. **Vanilla U-ViT**: It is important to report the results of a vanilla U-ViT architecture without the masked pre-training, as this would help discern whether the performance improvement stems from the pr
1. To improve the training efficiency of diffusion models is an important problem. 2. The experimental results on CelebA-HQ show that the method performs better than the baseline.
1. The paper only shows results on CelebA and CelebA-HQ which is not sufficient. More results on different other datasets need to be present to further demonstrate the effectiveness of the proposed method. 2. Results on CelebA 64x64 and 128x128 from Figure 4 (a) and (b) did not show that the proposed method has significant advantages over the baseline. 3. In Table 4, the results for the baseline which is U-ViT are missing.
1. The authors provide a thorough analysis with various ablation studies. 2. The problem of improving the distribution learning in the pretraining step of the diffusion models is interesting.
1. There are ambiguities in the manuscript. E.g. what is the point of learning a swiss roll distribution? This is mentioned as an example of learning the distribution, but it is not clear how it relates to the image generation task. Also, the paper's main claim in the title is on learning distributions, but there are no experiments supporting this claim. The paper only provides experiments on image generation quality based on FID and qualitative comparison. The distribution learning claim needs
1. The proposed Masked Diffusion Models (MaskDM) to reduce the training overhead of diffusion models is intuitive, and the analysis from the perspective of primer distribution is interesting. 2. The proposed two-stage training framework allows MaskDM to generalize to a new dataset with only a small amount of data for fine-tuning. 3. The paper is well-written and easy to follow.
1. The authors state that “our masked pre-training technique can be universally applied to various diffusion models that directly generate images in the pixel space, aiding in the learning of pre-trained models with superior generalizability.” The proposed Masked Diffusion Models are well suited to use VIT as the backbone. However, the current mainstream diffusion models use CNNs as the backbone. There is still insufficient evidence to determine whether Transformers contribute to the performance
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsDenoising Score Matching · Diffusion
