TL;DR
WinTok introduces a hybrid visual tokenizer that decouples semantic understanding and pixel reconstruction, improving performance across multiple benchmarks with less training data.
Contribution
It proposes a novel hybrid tokenizer with explicit decoupling and asymmetric token distillation, enhancing visual understanding and generation capabilities.
Findings
Outperforms baseline UniTok by 11.2% in classification accuracy.
Achieves a reconstruction rFID of 0.41 on 10 benchmarks.
Requires only 50M open-source data for training.
Abstract
Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
