Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Anlin Zheng; Xin Wen; Xuanyang Zhang; Chuofan Ma; Tiancai Wang; Gang Yu; Xiangyu Zhang; Xiaojuan Qi

arXiv:2507.08441·cs.CV·October 28, 2025

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

PDF

TL;DR

This paper introduces VFMTok, a novel image tokenizer built on a frozen vision foundation model, improving image generation quality, efficiency, and convergence speed through region-adaptive quantization and semantic alignment.

Contribution

It presents a new approach to building image tokenizers on frozen vision models, with innovative quantization and semantic reconstruction techniques.

Findings

01

Achieves a gFID of 1.36 on ImageNet benchmarks.

02

Accelerates autoregressive model convergence by three times.

03

Enables high-fidelity class-conditional image synthesis without classifier-free guidance.

Abstract

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.