UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

Chenkai Xu; Xu Wang; Zhenyi Liao; Yishun Li; Tianqi Hou; Zhijie Deng

arXiv:2502.05415·cs.CV·May 20, 2025

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng

PDF

Open Access 1 Repo 3 Reviews

TL;DR

UniCMs introduces a unified consistency model for multimodal generation and understanding, achieving superior performance and faster sampling in text-to-image and image-to-text tasks by combining discrete diffusion and autoregressive decoding.

Contribution

The paper proposes a novel unified consistency model that integrates discrete diffusion and autoregressive decoding for multimodal tasks, improving efficiency and performance.

Findings

01

Outperforms SD3 on GenEval, Image Reward, and CLIP Score

02

Requires only 1/8 of the sampling time compared to SD3

03

Surpasses Show-o on the MMMU benchmark with faster long-sequence generation

Abstract

Consistency models (CMs) have shown promise in the efficient generation of both image and text. This raises the natural question of whether we can learn a unified CM for efficient multimodal generation (e.g., text-to-image) and understanding (e.g., image-to-text). Intuitively, such a model could be acquired by applying the consistency distillation (CD) to existing unified multimodal models. However, the key challenge is establishing a unified denoising perspective for both image and text generation, which is essential for establishing the consistency mapping. To tackle this, at the representation level, we advocate for discrete tokens for both modalities to best preserve language modeling capabilities. Critically, instead of defining the text denoising trajectory via recent discrete diffusion language modeling principles, we specify it using the parallel decoding trace of an…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper is well-organized and easy to read and follow. 2. The overall motivation is clear and logical. 3. The distilled model achieves significant acceleration while preserving the multimodal understanding and generation performance.

Weaknesses

1. I thought it was necessary to study the acceleration in unified multimodal models. I was wondering if there are unique challenges in this area instead of directly adopting the distillation approaches well-studied in large language models and consistency models. 2. It looks like there are significant performance drops in several metrics in the T2I and MMU evaluations. 3. Is it possible to wrap the proposed pipeline into a general acceleration, as a lot of unified multimodal models have be

Reviewer 02Rating 2Confidence 3

Strengths

- The paper is well written and easy to follow - The numerical results are very good, their distilled version of Show-o is almost as capable as the original model but cutting down the computational cost significantly. - The experimental study is comprehensive with good ablations studies

Weaknesses

- The consistency loss for discrete diffusion is not very consistent. The distilled probability distributions would have to include the correlations between different tokens, however, in the current presentation, these are dropped, and it is unclear how the model is managing to actually distill things if tokens are being generated independently. A work that has studied this problem is [1] from which it can be seen the importance of learning such correlations. Without further explanation of this

Reviewer 03Rating 4Confidence 2

Strengths

1. Practical speedup: UniCMs deliver significant acceleration, which is valuable for real-world deployment. 2. Comprehensive evaluation: The experiments are thorough, with strong baselines and ablations. 3. Clarity: The paper is clearly written and well-illustrated.

Weaknesses

1. Limited Technical Contribution: The main advance is in speed, not in new algorithms or modeling paradigms. The technical novelty is incremental. 2. Performance not state-of-the-art: UniCMs do not outperform the best existing models on all benchmarks; there is a clear trade-off between speed and accuracy. 3. Initialization from Show-o: The framework is initialized with Show-o’s architecture and parameters, and the resulting performance is only comparable to Show-o, which further limits the ori

Code & Models

Repositories

zhijie-group/unicms
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsDiffusion · Contrastive Language-Image Pre-training · Consistency Models