Accelerate High-Quality Diffusion Models with Inner Loop Feedback
Matthew Gwilliam, Han Cai, Di Wu, Abhinav Shrivastava, Zhiyu Cheng

TL;DR
This paper introduces Inner Loop Feedback (ILF), a method that accelerates diffusion model inference by training a lightweight module to predict future features, achieving faster generation without sacrificing quality.
Contribution
ILF is a novel, flexible approach that trains a feedback module to predict future features, significantly speeding up diffusion models while maintaining high-quality outputs.
Findings
ILF achieves 1.7x-1.8x speedups in diffusion model inference.
ILF maintains high image quality with strong metrics like FID and CLIP scores.
The method is effective across different diffusion architectures and tasks.
Abstract
We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models' inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- THe paper is well written. - The method is examined both visually and numerically - The method improves the quality of the generation while reducing the number of diffusion steps. - The method is examined on 5 different diffusion models. - The approach uses relatively light-weight network for the feedback.
- The method is examined on only transformer based networks, which limits the evaluation of the approach. - The method is compared to caching that uses UNet architecture, making the comparison less effective. - Line 264 there is a typo, please rephrase. - The approach fine-tunes the network to predict score different than the one used in training, what is the advantage of that? also, how does it compare to fine-tuning the network itself without the feedback using this setup? - The method is comp
- The paper presents a method for distilling denoising steps in a diffusion model using inner feature maps, which could encapsulate more information than standard distillation that does not observe inner feature maps. - The qualitative results show no significant performance drop compared to the original model, while being much faster. - Results are demonstrated on SoTA text-to-image models.
- **Comparison to Step Distillation Approaches:** ILF introduces a self-distillation approach to accelerate the inference of diffusion models. Since this concept has been explored in previous works [1,2,3], a comprehensive comparison with these methods is essential. However, ILF is primarily compared to a caching-based approach that does not involve step-skipping training. The only comparison to another distillation method is a brief qualitative ablation (Fig. 11) with just three samples. This r
- The paper presents a novel, efficient method for diffusion model acceleration, focusing on maintaining high-quality generation while reducing inference time. - The approach is flexible, adaptable across multiple architectures (such as PixArt and DiT) and tasks. - The ILF technique effectively balances speed and image quality, outperforming caching methods and achieving strong performance in human-aligned quality metrics. - The Fast Approximate Distillation and Feedback-aware Inference Scheduli
Table 1 shows that the performance improvement of ILF over the baseline method with steps=2 is not always significant. In the Fast Approximate Distillation section, presenting the final noise expression with a formula would improve readability.
1. The proposed method is novel and solid. It's interesting that the authors combines the view of inner blocks and timesteps, I think the scheduler would be a tricky part during training but it worked. 2. Experiment results look good and the acceleration is considerable. 3. Experiment design and discussion covers most of the topics.
1. My main concern is about writing, there are a lot to be improved, I just list some examples here: - Avoid using verbal expressions and make the sentences concise. For example, “we already know that” (L96), “one does not need to store an entire additional set of models weights” (L92), “This is clearly not optimal” (L107), “feed its output features to the feedback” (module?) (L68), “for this caching” (L215) etc. - Make the long sentences logically fluent. E.g., “Different diffusion models have
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Nuclear reactor physics and engineering · Numerical methods for differential equations
MethodsDiffusion · Contrastive Language-Image Pre-training · Focus
