Latent Diffusion Models with Masked AutoEncoders
Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

TL;DR
This paper investigates the properties of autoencoders in Latent Diffusion Models, identifies key limitations, and proposes Variational Masked AutoEncoders (VMAEs) to improve image generation quality.
Contribution
It introduces VMAEs that leverage hierarchical features to enhance LDM autoencoders, addressing the lack of simultaneous property satisfaction in existing methods.
Findings
VMAEs improve latent smoothness and perceptual quality.
Integration of VMAEs enhances image generation performance.
The proposed framework outperforms existing autoencoder designs in LDMs.
Abstract
In spite of the remarkable potential of Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Our code is available at https://github.com/isno0907/ldmae.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
