Pixel-Space Post-Training of Latent Diffusion Models

Christina Zhang; Simran Motwani; Matthew Yu; Ji Hou; Felix Juefei-Xu,; Sam Tsai; Peter Vajda; Zijian He; Jialiang Wang

arXiv:2409.17565·cs.CV·September 27, 2024

Pixel-Space Post-Training of Latent Diffusion Models

Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu,, Sam Tsai, Peter Vajda, Zijian He, Jialiang Wang

PDF

Open Access

TL;DR

This paper introduces a pixel-space supervision method for post-training latent diffusion models, significantly enhancing high-frequency detail preservation and visual quality without sacrificing text alignment accuracy.

Contribution

It proposes a novel pixel-space post-training approach for LDMs, addressing high-frequency detail issues and improving visual quality in image generation.

Findings

01

Pixel-space supervision improves high-frequency detail preservation.

02

Enhanced visual quality and flaw metrics in LDMs.

03

Maintains text alignment quality after post-training.

Abstract

Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically $8 \times 8$ lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Concatenated Skip Connection · Diffusion · Max Pooling · U-Net