TL;DR
This paper introduces VFM-VAE, a new visual tokenizer leveraging frozen Vision Foundation Models for Latent Diffusion Models, achieving faster training and superior image generation quality.
Contribution
It proposes a direct VFM-based tokenizer with a novel decoder, improving efficiency and performance without distillation, and systematically studies the impact of different tokenizers.
Findings
VFM-VAE reaches a gFID of 2.22 in 80 epochs, 10 times faster than previous methods.
Extended training to 640 epochs improves gFID to 1.62.
The approach demonstrates the potential of VFMs as effective visual tokenizers.
Abstract
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE).To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
