TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liao Qu; Huichao Zhang; Yiheng Liu; Xu Wang; Yi Jiang; Yiming Gao; Hu Ye; Daniel K. Du; Zehuan Yuan; Xinglong Wu

arXiv:2412.03069·cs.CV·August 8, 2025

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu

PDF

Open Access 2 Repos 1 Models

TL;DR

TokenFlow introduces a dual-codebook image tokenizer that effectively separates semantic understanding from pixel-level generation, leading to significant improvements in multimodal understanding and high-quality image synthesis.

Contribution

It proposes a novel dual-codebook architecture that decouples semantic and pixel features, enhancing both understanding and generation capabilities in a unified framework.

Findings

01

Outperforms LLaVA-1.5 13B in understanding tasks with 7.2% improvement

02

Achieves FID score of 0.63 at 384x384 resolution for image reconstruction

03

Sets new state-of-the-art in autoregressive image generation with GenEval score of 0.55 at 256x256

Abstract

We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
ruohguo/avis
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques