Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Bohan Wang; Zhongqi Yue; Fengda Zhang; Shuo Chen; Li'an Bi; Junzhe Zhang; Xue Song; Kennard Yanting Chan; Jiachun Pan; Weijia Wu; Mingze Zhou; Wang Lin; Kaihang Pan; Saining Zhang; Liyu Jia; Wentao Hu; Wei Zhao; Hanwang Zhang

arXiv:2505.07538·cs.CV·May 28, 2025

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

PDF

Open Access 1 Models

TL;DR

Selftok introduces a novel discrete visual tokenizer that employs autoregressive modeling and diffusion processes, enabling effective reinforcement learning and high-quality image representation without relying on spatial priors.

Contribution

It proposes Selftok, a discrete visual tokenizer with autoregressive properties using diffusion, unifying vision-language modeling and reinforcement learning capabilities.

Findings

01

Selftok achieves state-of-the-art reconstruction quality and compression.

02

Reinforcement learning with Selftok significantly improves visual generation performance.

03

Selftok enables training vision-language models without text-image pairs.

Abstract

We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
selftok-team/SelftokTokenizer
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsDiffusion