Multistep Consistency Models
Jonathan Heek, Emiel Hoogeboom, Tim Salimans

TL;DR
This paper introduces Multistep Consistency Models that unify consistency and diffusion models, enabling a flexible trade-off between sampling speed and quality, and demonstrating strong empirical results on image generation tasks.
Contribution
It proposes a novel unification of consistency and diffusion models allowing interpolation between them, improving training and sample quality with multiple steps.
Findings
Achieves 1.4 FID on Imagenet 64 in 8 steps
Achieves 2.1 FID on Imagenet128 in 8 steps
Scales to text-to-image diffusion models with high quality
Abstract
Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas a -step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The aDDIM sampler addresses the oversmoothing problem in deterministic sampling mechanisms, achieving performance comparable to that of a second-order sampler. The paper also provides comprehensive experiments on class-conditioned ImageNet and text-to-image generation.
See questions below.
This work extends the Consistency Model by leveraging the inversion of DDIM to generate a clean regression target based on the DDIM solution at a middle timestep, starting from a noisier timestep. The proposed method demonstrates strong empirical performance in diffusion distillation without the need for adversarial training.
1. There are some typos, such as in line 101 where "*One than...*" should be "*One then...*". Additionally, some characters should be in bold but are not. Some parts are not consistent, such as using both *Table XX* and *Tbl. XX*. 2. Regarding the flow in Section 2, would it improve clarity to switch the paragraphs *Consistency Training and Distillation* and *DDIM Sampler*? Currently, the notation $\mathrm{DDIM}_{t\rightarrow s}$ appears in Eq. (3) without prior introduction. Additionally, cert
This paper proposes a method that can be directly applied to the large-scale models like SDXL.
Major Comments - In line 135, what does the term "difficult" refer to? Is it about the instability during training or the model’s overall performance? - In the paragraph starting at line 145, my understanding is that Consistency Models did not originally introduce a gap between model evaluations at t and s, although extending in that direction seems intuitive. Additionally, even as s->t, would longer propagation during training significantly affect the author's setup? To my knowledge, 200k iter
1. MCM achieves a trade-off between DM and CM. Unlike CTM, MCM does not rely on GAN loss to achieve high-quality samples, which yields more stable training and more flexible application. 2. The author discussed that the DDIM step tends to underestimate the variance. The proposed aDDIM can address this mismatch and improve the results. 3. The author provides both lower-resolution and higher-resolution results using MCM, with and without distillation. These comprehensive experiments demonstrate th
1. **(major) the trade-off seems to be limited.** This paper achieves a trade-off between CMs and DMs. However, this trade-off is achieved by set the number of student steps in advance. While this approach demonstrates a clear improvement over standard CMs and DMs in terms of generation quality and speed, respectively, it significantly limits flexibility due to the fixed design choice. The authors also discuss CTMs, arguing that adversarial training is required to ensure high-quality samples. Ho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making · Transportation and Mobility Innovations
MethodsConsistency Models · Diffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
