TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu

TL;DR
TokenFlow introduces a dual-codebook image tokenizer that effectively separates semantic understanding from pixel-level generation, leading to significant improvements in multimodal understanding and high-quality image synthesis.
Contribution
It proposes a novel dual-codebook architecture that decouples semantic and pixel features, enhancing both understanding and generation capabilities in a unified framework.
Findings
Outperforms LLaVA-1.5 13B in understanding tasks with 7.2% improvement
Achieves FID score of 0.63 at 384x384 resolution for image reconstruction
Sets new state-of-the-art in autoregressive image generation with GenEval score of 0.55 at 256x256
Abstract
We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
