TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Haokun Lin; Teng Wang; Yixiao Ge; Yuying Ge; Zhichao Lu; Ying Wei; Qingfu Zhang; Zhenan Sun; Ying Shan

arXiv:2505.05422·cs.CV·August 18, 2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan

PDF

Open Access 1 Repo 1 Models

TL;DR

TokLIP introduces a semantic-enhanced visual tokenizer that improves multimodal understanding and generation efficiency by integrating high-level semantics with standard VQ tokens in an end-to-end training framework.

Contribution

It proposes a novel visual tokenizer that combines VQ tokens with CLIP-level semantics, enabling efficient multimodal comprehension and generation with disentangled training objectives.

Findings

01

Achieves high data efficiency in multimodal tasks

02

Enhances semantic understanding of visual tokens

03

Improves generative capacity for autoregressive models

Abstract

Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/toklip
jaxOfficial

Models

🤗
TencentARC/TokLIP
model· 12 dl· ♡ 13
12 dl♡ 13

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning