Vision Foundation Models as Generalist Tokenizers for Image Generation

Anlin Zheng; Qi Han; Xin Wen; Chuofan Ma; Lanxi Gong; Gang Yu; Xiangyu Zhang; Xiaojuan Qi

arXiv:2605.18390·cs.CV·May 19, 2026

Vision Foundation Models as Generalist Tokenizers for Image Generation

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

PDF

TL;DR

This paper introduces VFMTok, a novel image tokenizer built on frozen vision foundation models, achieving state-of-the-art synthesis quality and efficiency in both discrete and continuous latent spaces for image generation.

Contribution

The work presents a region-adaptive quantization framework and semantic reconstruction objective, enabling VFMTok to operate as a generalist visual tokenizer with significant improvements.

Findings

01

VFMTok accelerates autoregressive model convergence by 3x.

02

Achieves a state-of-the-art gFID of 1.36 on ImageNet.

03

Enables high-fidelity class-conditional synthesis without classifier-free guidance.

Abstract

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.