Vision Foundation Models as Generalist Tokenizers for Image Generation
Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

TL;DR
This paper introduces VFMTok, a novel image tokenizer built on frozen vision foundation models, achieving state-of-the-art synthesis quality and efficiency in both discrete and continuous latent spaces for image generation.
Contribution
The work presents a region-adaptive quantization framework and semantic reconstruction objective, enabling VFMTok to operate as a generalist visual tokenizer with significant improvements.
Findings
VFMTok accelerates autoregressive model convergence by 3x.
Achieves a state-of-the-art gFID of 1.36 on ImageNet.
Enables high-fidelity class-conditional synthesis without classifier-free guidance.
Abstract
In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
