Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen; Chen Ju; Zhicheng Wang; Shuai Xiao; Xu Chen; Jinsong Lan; Xiaoyong Zhu; Ying Chen

arXiv:2511.01593·cs.CV·November 4, 2025

Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, Ying Chen

PDF

Open Access

TL;DR

This paper introduces CDD-VT, a novel visual tokenizer inspired by wave-particle duality, which adaptively combines continuous and discrete representations to improve multi-modal understanding and generation.

Contribution

The paper proposes a dualistic visual tokenizer that adaptively balances continuous and discrete features based on image complexity, enhancing performance and scalability.

Findings

01

Outperforms specialized continuous and discrete tokenizers in multiple tasks.

02

Achieves superior reconstruction, retrieval, and classification results.

03

Provides a scalable, unified approach for multi-modal large models.

Abstract

The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques