Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation
Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, Ying Chen

TL;DR
This paper introduces CDD-VT, a novel visual tokenizer inspired by wave-particle duality, which adaptively combines continuous and discrete representations to improve multi-modal understanding and generation.
Contribution
The paper proposes a dualistic visual tokenizer that adaptively balances continuous and discrete features based on image complexity, enhancing performance and scalability.
Findings
Outperforms specialized continuous and discrete tokenizers in multiple tasks.
Achieves superior reconstruction, retrieval, and classification results.
Provides a scalable, unified approach for multi-modal large models.
Abstract
The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques
