Towards Scalable Pre-training of Visual Tokenizers for Generation

Jingfeng Yao; Yuda Song; Yucong Zhou; Xinggang Wang

arXiv:2512.13687·cs.CV·March 9, 2026

Towards Scalable Pre-training of Visual Tokenizers for Generation

Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang

PDF

Open Access 3 Models

TL;DR

This paper introduces VTP, a unified pre-training framework for visual tokenizers that improves their high-level semantic representation, leading to better generative performance and scalability in vision models.

Contribution

The paper proposes VTP, a novel joint optimization approach for visual tokenizers, addressing the pre-training scaling problem and enhancing generative quality and scalability.

Findings

01

Understanding semantics boosts generation quality.

02

VTP scales effectively with compute, data, and parameters.

03

Pre-trained VTP models outperform traditional autoencoders in downstream tasks.

Abstract

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques