MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie; Chen Du; Ping Song; Chang Liu

arXiv:2411.17762·cs.CV·July 29, 2025

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song, Chang Liu

PDF

Open Access

TL;DR

MUSE-VL introduces Semantic Discrete Encoding to align visual and language tokens, reducing training data needs and enhancing performance in multimodal understanding and generation.

Contribution

The paper proposes Semantic Discrete Encoding, a novel approach that improves alignment between visual and language tokens in unified vision-language models.

Findings

01

Achieved 4.8% better understanding performance over previous SOTA Emu3.

02

Surpassed dedicated understanding model LLaVA-NeXT 34B by 3.7%.

03

Outperformed existing unified models on visual generation benchmarks.

Abstract

We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Advanced Computational Techniques and Applications · Neural Networks and Applications

MethodsALIGN