WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

Yiwei Guo; Shaobin Zhuang; Zhipeng Huang; Canmiao Fu; Chen Li; Jing Lyu; Yali Wang

arXiv:2605.18115·cs.CV·May 19, 2026

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang

PDF

1 Repo 1 Models

TL;DR

WinTok introduces a hybrid visual tokenizer that decouples semantic understanding and pixel reconstruction, improving performance across multiple benchmarks with less training data.

Contribution

It proposes a novel hybrid tokenizer with explicit decoupling and asymmetric token distillation, enhancing visual understanding and generation capabilities.

Findings

01

Outperforms baseline UniTok by 11.2% in classification accuracy.

02

Achieves a reconstruction rFID of 0.41 on 10 benchmarks.

03

Requires only 50M open-source data for training.

Abstract

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

markywg/WinTok
github

Models

🤗
markyw/WinTok
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.