Improved Training Technique for Latent Consistency Models
Quan Dao, Khanh Doan, Di Liu, Trung Le, Dimitris Metaxas

TL;DR
This paper presents a novel training technique for latent consistency models, incorporating outlier mitigation, diffusion loss, optimal transport, and adaptive scaling to improve high-quality one- or two-step sampling in the latent space.
Contribution
It introduces multiple strategies including Cauchy loss, diffusion loss, OT coupling, and adaptive scaling to enhance latent consistency model training and performance.
Findings
Achieved high-quality one- or two-step sampling in latent space.
Significantly narrowed the performance gap with diffusion models.
Demonstrated robustness against outliers in latent data.
Abstract
Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal…
Peer Reviews
Decision·ICLR 2025 Poster
1. The analysis and motivation of this manuscript is sound, and to the best of my knowledge, the effect of the potential spatial anomalies they first revealed on consistent model training. 2. Each of the proposed techniques was well ablated. 3. From the visualizations provided by the authors (Figures 4,5), several of the proposed techniques do significantly improve the visuals of data such as Celeba-HQ.
1. As an empirical paper, the authors seem to have compared only with iCTs that reproduce in hidden spaces. In fact, there have been many improvements on the consistent model of lifting hidden spaces, such as [1, 2], with which authors should compare or discuss. [1] Hyper SD: Trajectory Segmented Consistency Model for Effective Image Synthesis [2] Trajectory consistency disruption 2. The authors' experiments are limited to some simple modal datasets such as the FFHQ, CELEBA-HQ datasets. Empir
1. The motivation of this article are very sound and the introductory section does a good job of presenting the motivation and contribution of the article. Given the popularity of potential spatial diffusion modeling, this research may have important implications for the practical application of accelerated diffusion. 2. The authors found significant differences between latent space training and pixel space training. The latter is usually normalized and the former may have impulse noise. This f
1. The link to TD training in DQN in section 4.1 seems somewhat redundant. There is no evidence that DQN has a similar problem with impulse noise. And there is no solution proposed by the authors that is not derived from it. 2. The authors mention in the introduction that the aim is to address the potential proliferation of large-scale applications such as text-to-image or video generation. However, instead of using a text-to-image model like LCM, the authors ended up experimenting on some simp
1. The writing of the paper is clear. 2. The paper analyzes the reasons for the poor performance of improved consistency training in latent space. 3. The results of the proposed method achieved a great improvement than improved consistency training in latent space.
1. Equation (9) should be ||f(xt)-x0||^2. 2. Please briefly explain how the constant in Equation (11) is determined. 3. Batch size has a significant impact on generative models, especially in consistency training; please include the batch size in Table 1. 4. Please include a comparison of the results from latent consistency distillation, including the resources used for training.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsADaptive gradient method with the OPTimal convergence rate · Diffusion · Consistency Models
