Towards Scalable Pre-training of Visual Tokenizers for Generation
Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang

TL;DR
This paper introduces VTP, a unified pre-training framework for visual tokenizers that improves their high-level semantic representation, leading to better generative performance and scalability in vision models.
Contribution
The paper proposes VTP, a novel joint optimization approach for visual tokenizers, addressing the pre-training scaling problem and enhancing generative quality and scalability.
Findings
Understanding semantics boosts generation quality.
VTP scales effectively with compute, data, and parameters.
Pre-trained VTP models outperform traditional autoencoders in downstream tasks.
Abstract
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques
