Truncated Consistency Models

Sangyun Lee; Yilun Xu; Tomas Geffner; Giulia Fanti; Karsten Kreis,; Arash Vahdat; Weili Nie

arXiv:2410.14895·cs.LG·January 24, 2025

Truncated Consistency Models

Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis,, Arash Vahdat, Weili Nie

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a truncated consistency training approach for diffusion models, improving one-step generation quality and efficiency by focusing on relevant time ranges and preventing trivial solutions.

Contribution

It proposes a novel truncated training method, a new parameterization, and a two-stage training process that enhances consistency models' performance and reduces model size.

Findings

01

Achieves better FID scores on CIFAR-10 and ImageNet 64x64 datasets.

02

Uses over 2x smaller networks than previous state-of-the-art.

03

Improves one-step and two-step generation quality.

Abstract

Consistency models have recently been introduced to accelerate sampling from diffusion models by directly predicting the solution (i.e., data) of the probability flow ODE (PF ODE) from initial noise. However, the training of consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. This task is much more challenging than the ultimate objective of one-step generation, which only concerns the PF ODE's noise-to-data mapping. We empirically find that this training paradigm limits the one-step generation performance of consistency models. To address this issue, we generalize consistency training to the truncated time range, which allows the model to ignore denoising tasks at earlier time steps and focus its capacity on generation. We propose a new parameterization of the consistency function and a two-stage training…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 5

Strengths

- The paper is well-written and is easy to follow. The math derivations are technically correct. - The idea of truncating is novel, and the proposed method can further address the training instability of consistency models which are important to the community. - The empirical results are strong, showing the effectiveness of the method.

Weaknesses

It is unclear whether the effectiveness is from the truncation or is just from the changing of proposal distribution by focusing more on the boundary parts. Specifically, the major techniques in this paper include two parts: 1. two-stage truncated training to avoid the overtraining of denoising tasks; 2. changing proposal distribution to focus more on the boundary conditions. However, with only the first method, the training still diverges (L241-L242), which shows that the part 2 seems to be m

Reviewer 02Rating 6Confidence 3

Strengths

- The motivation is the difficulty of learning the consistency function across the whole time zone, which is sound. The second stage model only needs to map noised data to clean data in a limited time interval, which is an easier task, and it is no surprise to bring better fitting for the ODE trajectory. - The phenomenon that the consistency training gradually weakens the model's denoising capabilities at small $t$ is well illustrated in Figure 2. - The method achieves better FID on standard ima

Weaknesses

- The overall idea and motivation are actually not new. ECT already points out the trade-off between the denoising capacity and the consistency capacity, suggesting the initialization of a consistency model with a diffusion model. The authors' work, in my opinion, is to use a two-stage method to replace the dedicated iteration-dependent training schedule in ECT. - The 2-step performance improvement is relatively marginal compared to ECT. As suggested by ECT, it is better to use a 2-step generati

Reviewer 03Rating 6Confidence 3

Strengths

The paper is well-structured and clear. I appreciate the extensive experiments illustrating the trade-off between denoising and generation in consistency training. Additionally, the empirical results seem competitive to baselines.

Weaknesses

1. The approach appears a bit over-engineered, with numerous handcrafted designs and hyperparameters, such as the weighting function $\psi_t(t)$, $\lambda_b$, $N_B$, $\Delta_t$, $\Delta_{t'}$, and the interval division $t'$. 2. How should one determine the terminate point of Stage 1 training? Monitoring its progress may introduce additional complexity. The number of training iterations for Stage 1 may also represent a crucial hyperparameter that may require further ablation studies. 3. The pap

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic Policies and Impacts · Complex Systems and Decision Making · Transportation and Mobility Innovations

MethodsConsistency Models · Diffusion · Focus