Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang; Da Peng; Zonghao Guo; Zijian Zhang; Xuesong Yang; Tong Sun; Shichu Sun; Yidan Zhang; Yanghao Li; Haiyan Zhao; Wang Xu; Qi Shi; Yangang Sun; Chi Chen; Shuo Wang; Yukun Yan; Xu Han; Qiang Ma; Wei Ke; Liang Wang; Zhiyuan Liu; Maosong Sun

arXiv:2603.12793·cs.CV·March 16, 2026

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Models

TL;DR

Cheers introduces a unified multimodal model that decouples patch details from semantics, enabling efficient and high-quality visual understanding and generation with improved fidelity and reduced training costs.

Contribution

The paper presents Cheers, a novel model that unifies multimodal tasks by decoupling visual patch details from semantic representations, enhancing efficiency and performance.

Findings

01

Matches or surpasses state-of-the-art in visual understanding and generation.

02

Achieves 4x token compression for high-resolution image processing.

03

Outperforms Tar-1.5B on GenEval and MMBench benchmarks with less training cost.

Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ai9stars/Cheers
model· 175 dl· ♡ 25
175 dl♡ 25

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques