Multistep Consistency Models

Jonathan Heek; Emiel Hoogeboom; Tim Salimans

arXiv:2403.06807·cs.LG·November 20, 2024·3 cites

Multistep Consistency Models

Jonathan Heek, Emiel Hoogeboom, Tim Salimans

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Multistep Consistency Models that unify consistency and diffusion models, enabling a flexible trade-off between sampling speed and quality, and demonstrating strong empirical results on image generation tasks.

Contribution

It proposes a novel unification of consistency and diffusion models allowing interpolation between them, improving training and sample quality with multiple steps.

Findings

01

Achieves 1.4 FID on Imagenet 64 in 8 steps

02

Achieves 2.1 FID on Imagenet128 in 8 steps

03

Scales to text-to-image diffusion models with high quality

Abstract

Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas a $\infty$ -step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 5

Strengths

The aDDIM sampler addresses the oversmoothing problem in deterministic sampling mechanisms, achieving performance comparable to that of a second-order sampler. The paper also provides comprehensive experiments on class-conditioned ImageNet and text-to-image generation.

Weaknesses

See questions below.

Reviewer 02Rating 5Confidence 4

Strengths

This work extends the Consistency Model by leveraging the inversion of DDIM to generate a clean regression target based on the DDIM solution at a middle timestep, starting from a noisier timestep. The proposed method demonstrates strong empirical performance in diffusion distillation without the need for adversarial training.

Weaknesses

1. There are some typos, such as in line 101 where "*One than...*" should be "*One then...*". Additionally, some characters should be in bold but are not. Some parts are not consistent, such as using both *Table XX* and *Tbl. XX*. 2. Regarding the flow in Section 2, would it improve clarity to switch the paragraphs *Consistency Training and Distillation* and *DDIM Sampler*? Currently, the notation $\mathrm{DDIM}_{t\rightarrow s}$ appears in Eq. (3) without prior introduction. Additionally, cert

Reviewer 03Rating 3Confidence 4

Strengths

This paper proposes a method that can be directly applied to the large-scale models like SDXL.

Weaknesses

Major Comments - In line 135, what does the term "difficult" refer to? Is it about the instability during training or the model’s overall performance? - In the paragraph starting at line 145, my understanding is that Consistency Models did not originally introduce a gap between model evaluations at t and s, although extending in that direction seems intuitive. Additionally, even as s->t, would longer propagation during training significantly affect the author's setup? To my knowledge, 200k iter

Reviewer 04Rating 3Confidence 4

Strengths

1. MCM achieves a trade-off between DM and CM. Unlike CTM, MCM does not rely on GAN loss to achieve high-quality samples, which yields more stable training and more flexible application. 2. The author discussed that the DDIM step tends to underestimate the variance. The proposed aDDIM can address this mismatch and improve the results. 3. The author provides both lower-resolution and higher-resolution results using MCM, with and without distillation. These comprehensive experiments demonstrate th

Weaknesses

1. **(major) the trade-off seems to be limited.** This paper achieves a trade-off between CMs and DMs. However, this trade-off is achieved by set the number of student steps in advance. While this approach demonstrates a clear improvement over standard CMs and DMs in terms of generation quality and speed, respectively, it significantly limits flexibility due to the fixed design choice. The authors also discuss CTMs, arguing that adversarial training is required to ensure high-quality samples. Ho

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Decision Making · Transportation and Mobility Innovations

MethodsConsistency Models · Diffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings