UniTok: A Unified Tokenizer for Visual Generation and Understanding

Chuofan Ma; Yi Jiang; Junfeng Wu; Jihan Yang; Xin Yu; Zehuan Yuan; Bingyue Peng; Xiaojuan Qi

arXiv:2502.20321·cs.CV·October 27, 2025

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi

PDF

Open Access 1 Repo 2 Models

TL;DR

UniTok introduces a multi-codebook quantization tokenizer that unifies visual generation and understanding, achieving state-of-the-art results and enabling seamless integration into multimodal models without performance trade-offs.

Contribution

The paper proposes UniTok, a novel unified tokenizer with multi-codebook quantization, overcoming capacity limitations and unifying visual generation and understanding tasks.

Findings

01

Sets new record of 0.38 rFID and 78.6% zero-shot ImageNet accuracy.

02

Reduces gFID from 14.6 to 2.5 on ImageNet 256x256.

03

Enables seamless integration into multimodal models for visual tasks.

Abstract

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

foundationvision/unitok
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training