UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

Yanzhe Chen (Yen-chieh Chan); Huasong Zhong; Yan Li; Zhenheng Yang

arXiv:2506.20214·cs.CV·July 9, 2025

UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

Yanzhe Chen (Yen-chieh Chan), Huasong Zhong, Yan Li, Zhenheng Yang

PDF

Open Access

TL;DR

UniCode$^2$ introduces a cascaded large-scale visual codebook system that enhances multimodal understanding and generation by improving token semantics, stability, and alignment with text, enabling better visual synthesis and comprehension.

Contribution

The paper presents a novel cascaded codebook framework with 500K entries that improves semantic alignment, stability, and scalability in visual tokenization for multimodal models.

Findings

01

Achieves high performance across multiple benchmarks.

02

Enables high-quality visual synthesis with minimal adaptation.

03

Maintains stability and semantic alignment at large scale.

Abstract

Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable training. We propose UniCode $^{2}$ , a cascaded codebook framework enabling large-scale, semantically aligned, and stable visual tokenization. By clustering millions of SigLIP sequence embeddings, we build a 500K-entry codebook that preserves vision-language alignment while expanding capacity. Stability is ensured via a cascaded design: a frozen codebook anchors the embedding space, and a trainable codebook refines task-specific semantics. This decoupling promotes high utilization and robust learning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion