EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation
Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang

TL;DR
EvoTok introduces a residual latent evolution approach to unify image understanding and generation within a shared token space, achieving high-quality reconstruction and strong benchmark performance with a modest dataset.
Contribution
EvoTok proposes a novel residual evolution process in a shared latent space to unify visual understanding and generation, overcoming previous interference issues.
Findings
Achieves 0.43 rFID on ImageNet-1K at 256x256 resolution.
Performs well on 7 out of 9 visual understanding benchmarks.
Shows remarkable results on image generation benchmarks.
Abstract
The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
