TL;DR
This paper introduces a latent perceptual loss for latent diffusion models that enhances image sharpness and realism, significantly improving FID scores across multiple datasets and resolutions.
Contribution
It proposes a novel latent perceptual loss that integrates decoder features to improve image quality in latent diffusion models, addressing the disconnect between diffusion training and image decoding.
Findings
FID scores improved by 6% to 20% with the perceptual loss.
Enhanced image sharpness and realism demonstrated in qualitative results.
Applicable across various diffusion paradigms and autoencoders.
Abstract
Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as…
Peer Reviews
Decision·ICLR 2025 Poster
1. The motivation is clear and straightforward. And the proposed method is simple and can be easily applied to the training of other diffusion models. 2. Under the same training iterations during the post-training stage, the method can improve the FID over the baseline method which only adopts the MSE loss.
1. The paper only shows the performance increase over the baseline model. I feel like it's better to clearly demonstrate the effectiveness and performance gain over the previous state-of-the-art methods, to show that the perceptual loss can achieve what the widely used MSE loss cannot achieve. 2. The introduction of the perceptual loss would increase the computation cost during the training stage. Could the authors provide a clear comparison on this? 3. Following the last one, what would be the
The comparisons with baselines are extensive and clearly show that the proposed latent perceptual loss objective improves metrics across different diffusion formulations and datasets. I really appreciate the authors demonstrating the improvements for both eps-pred diffusion and flow matching, demonstrating that the proposed loss potentially has a general significance. The paper ablates over/explores a large number of parameters and their influence. The paper is generally reasonably well-writte
I think framing the proposed loss as a perceptual loss is likely incorrect. Perceptual losses typically try to incorporate human perception-based invariances into the loss, such as weighting the presence of the correct texture (e.g., grass) as more important than getting every detail of the instance of the texture (e.g., the exact positions of individual blades of grass) right. This is directly opposed to losses such as the MSE in pixel space. This is typically accomplished by taking features of
- The author proposed latent perceptual loss (LPL) which shows the efficacy over various tasks datasets. - The qualitative results and the quantitative metrics seem promising - The frequency analysis showcases that the methods works - The ablations are carried out in a systematic way
- The paper seems unpolished and rushed towards the deadline - Sometimes the notation is a bit confusing, see the third point in the questions section - Applying perceptual loss in T2I generation isn’t novel; Lin and Yang [1] calculates the perceptual loss in middle blocks to reduce computation which bypasses the computational constraint, whereas this paper requires passing results to the decoder for intermediate features. If the authors can provide evidence that the method proposed is better an
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · Autoencoders
