UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Shaobin Zhuang; Yuang Ai; Jiaming Han; Weijia Mao; Xiaohui Li; Fangyikang Wang; Xiao Wang; Yan Li; Shanchuan Lin; Kun Xu; Zhenheng Yang; Huaibo Huang; Xiangyu Yue; Hao Chen; Yali Wang

arXiv:2602.14178·cs.CV·March 12, 2026

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang

PDF

Open Access

TL;DR

UniWeTok introduces a massive binary codebook-based unified visual tokenizer that enhances multimodal large language models by balancing high-fidelity reconstruction, semantic extraction, and generative abilities, with state-of-the-art performance and low training costs.

Contribution

The paper presents UniWeTok, a novel unified discrete visual tokenizer with a 2^128 binary codebook, innovative training methods, and a hybrid architecture, enabling improved multimodal understanding and generation.

Findings

01

Achieves state-of-the-art image generation performance on ImageNet

02

Requires significantly less training compute than previous methods

03

Demonstrates strong capabilities across diverse multimodal tasks

Abstract

Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ( $2^{128}$ ). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning