Diffusion Models Need Visual Priors for Image Generation

Xiaoyu Yue; Zidong Wang; Zeyu Lu; Shuyang Sun; Meng Wei; Wanli Ouyang,; Lei Bai; Luping Zhou

arXiv:2410.08531·cs.CV·October 14, 2024

Diffusion Models Need Visual Priors for Image Generation

Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang,, Lei Bai, Luping Zhou

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Diffusion on Diffusion (DoD), a multi-stage framework that enhances image generation by incorporating visual priors extracted from previous samples, significantly improving quality and reducing training costs.

Contribution

The paper proposes a novel multi-stage diffusion framework that leverages visual priors via a latent embedding module, leading to superior image quality with less training.

Findings

01

DoD reduces training cost by 7× compared to SiT and DiT.

02

DoD-XL achieves an FID-50K score of 1.83 with only 1 million steps.

03

DoD outperforms state-of-the-art methods in image quality without additional inference complexity.

Abstract

Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet- $256 \times 256$ dataset, reducing 7 $\times$ …

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 5

Strengths

Experiments demonstrate that running the trained network in a multi-stage fashion yields visual quality improvements in the generated images. The image output by each stage appears to refine the details of the image produced by the previous stage, indicating that the latent embedding acts as a useful guide to the generation process. Quality improvements (as measured by FID) are achieved even when comparing to baselines at comparable inference cost (in FLOPs), as shown in Table 4.

Weaknesses

Figure 4 suggests that beyond a two-stage system (one unconditional and one conditional generation pass), there is minimal benefit to subsequent conditional generation stages; stage 2 and stage 3 FID scores seem nearly identical. Linear probing scores (Table 2) suggest that the LEM is learning only a limited semantic representation. For example, a pre-trained contrastive encoder would score much better on ImageNet linear probing. I am concerned about the novelty of the contribution. The over

Reviewer 02Rating 5Confidence 3

Strengths

Overall, the image quality in the paper seems fine with good details. Overall presentation of the paper is clear.

Weaknesses

The method includes multi-stage sampling, which makes the already slow sampling process even more time-consuming and computationally inefficient. However, the experimental results and design that demonstrate your method's computational efficiency are somewhat confusing and unconvincing to me. I have addressed some of these points in the next section.

Reviewer 03Rating 5Confidence 3

Strengths

(1) The proposed method enhances image generation quality compared to the baseline. (2) The authors conduct a comprehensive experimental analysis of model configurations for conditional image generation tasks, including visualizations for qualitative comparisons. (3) The multi-stage generative model utilizes a shared backbone, efficiently reducing the total parameter count as additional stages are added.

Weaknesses

(1) As shown in Figure 5, the images generated across multiple stages appear nearly identical, making it difficult to visually assess the advantages of the multi-stage system. (2) Although the multi-stage system incorporates an efficient shared-parameter design, its FLOPs increase with each additional stage. Performance seems to plateau after two stages, raising questions about the scalability of adding more stages.

Reviewer 04Rating 3Confidence 4

Strengths

1) From a theoretical level, I find the paper very nice. It is a) simple / elegant and b) fast. From a novelty perspective, I have no concerns. 2) The paper includes multiple strong results concerning the practical usefulness of the method (num parameters, steps, GFLOPS) 3) This paper is well-written, and well-presented

Weaknesses

Points are in order of my perceived importance (most to least), indicating how heavily they weigh in my rating. 1) Table 5 (main results) seem to primarily compare against basic diffusion models, but miss comparisons against other post-hoc diffusion methods (e.g. MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer, Gao 2023 already cited in this paper; ReNO: Enhancing One-step Text-to-Image Models Through Reward-based Noise Optimization, Eyring 2024; ElasticDiffusion: Training-fre

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques

MethodsDiffusion