TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan

TL;DR
TokLIP introduces a semantic-enhanced visual tokenizer that improves multimodal understanding and generation efficiency by integrating high-level semantics with standard VQ tokens in an end-to-end training framework.
Contribution
It proposes a novel visual tokenizer that combines VQ tokens with CLIP-level semantics, enabling efficient multimodal comprehension and generation with disentangled training objectives.
Findings
Achieves high data efficiency in multimodal tasks
Enhances semantic understanding of visual tokens
Improves generative capacity for autoregressive models
Abstract
Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
