UniTok: A Unified Tokenizer for Visual Generation and Understanding
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi

TL;DR
UniTok introduces a multi-codebook quantization tokenizer that unifies visual generation and understanding, achieving state-of-the-art results and enabling seamless integration into multimodal models without performance trade-offs.
Contribution
The paper proposes UniTok, a novel unified tokenizer with multi-codebook quantization, overcoming capacity limitations and unifying visual generation and understanding tasks.
Findings
Sets new record of 0.38 rFID and 78.6% zero-shot ImageNet accuracy.
Reduces gFID from 14.6 to 2.5 on ImageNet 256x256.
Enables seamless integration into multimodal models for visual tasks.
Abstract
Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
