VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi; Xiaoyi Zhang; Yan Lu; Nanning Zheng

arXiv:2510.18457·cs.CV·April 24, 2026

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng

PDF

1 Repo 1 Models

TL;DR

This paper introduces VFM-VAE, a new visual tokenizer leveraging frozen Vision Foundation Models for Latent Diffusion Models, achieving faster training and superior image generation quality.

Contribution

It proposes a direct VFM-based tokenizer with a novel decoder, improving efficiency and performance without distillation, and systematically studies the impact of different tokenizers.

Findings

01

VFM-VAE reaches a gFID of 2.22 in 80 epochs, 10 times faster than previous methods.

02

Extended training to 640 epochs improves gFID to 1.62.

03

The approach demonstrates the potential of VFMs as effective visual tokenizers.

Abstract

The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE).To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiancib/VFM-VAE
github

Models

🤗
tiancibi/VFM-VAE
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.