Diffusion Models Need Visual Priors for Image Generation
Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang,, Lei Bai, Luping Zhou

TL;DR
This paper introduces Diffusion on Diffusion (DoD), a multi-stage framework that enhances image generation by incorporating visual priors extracted from previous samples, significantly improving quality and reducing training costs.
Contribution
The paper proposes a novel multi-stage diffusion framework that leverages visual priors via a latent embedding module, leading to superior image quality with less training.
Findings
DoD reduces training cost by 7× compared to SiT and DiT.
DoD-XL achieves an FID-50K score of 1.83 with only 1 million steps.
DoD outperforms state-of-the-art methods in image quality without additional inference complexity.
Abstract
Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet- dataset, reducing 7…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Experiments demonstrate that running the trained network in a multi-stage fashion yields visual quality improvements in the generated images. The image output by each stage appears to refine the details of the image produced by the previous stage, indicating that the latent embedding acts as a useful guide to the generation process. Quality improvements (as measured by FID) are achieved even when comparing to baselines at comparable inference cost (in FLOPs), as shown in Table 4.
Figure 4 suggests that beyond a two-stage system (one unconditional and one conditional generation pass), there is minimal benefit to subsequent conditional generation stages; stage 2 and stage 3 FID scores seem nearly identical. Linear probing scores (Table 2) suggest that the LEM is learning only a limited semantic representation. For example, a pre-trained contrastive encoder would score much better on ImageNet linear probing. I am concerned about the novelty of the contribution. The over
Overall, the image quality in the paper seems fine with good details. Overall presentation of the paper is clear.
The method includes multi-stage sampling, which makes the already slow sampling process even more time-consuming and computationally inefficient. However, the experimental results and design that demonstrate your method's computational efficiency are somewhat confusing and unconvincing to me. I have addressed some of these points in the next section.
(1) The proposed method enhances image generation quality compared to the baseline. (2) The authors conduct a comprehensive experimental analysis of model configurations for conditional image generation tasks, including visualizations for qualitative comparisons. (3) The multi-stage generative model utilizes a shared backbone, efficiently reducing the total parameter count as additional stages are added.
(1) As shown in Figure 5, the images generated across multiple stages appear nearly identical, making it difficult to visually assess the advantages of the multi-stage system. (2) Although the multi-stage system incorporates an efficient shared-parameter design, its FLOPs increase with each additional stage. Performance seems to plateau after two stages, raising questions about the scalability of adding more stages.
1) From a theoretical level, I find the paper very nice. It is a) simple / elegant and b) fast. From a novelty perspective, I have no concerns. 2) The paper includes multiple strong results concerning the practical usefulness of the method (num parameters, steps, GFLOPS) 3) This paper is well-written, and well-presented
Points are in order of my perceived importance (most to least), indicating how heavily they weigh in my rating. 1) Table 5 (main results) seem to primarily compare against basic diffusion models, but miss comparisons against other post-hoc diffusion methods (e.g. MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer, Gao 2023 already cited in this paper; ReNO: Enhancing One-step Text-to-Image Models Through Reward-based Noise Optimization, Eyring 2024; ElasticDiffusion: Training-fre
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques
MethodsDiffusion
