UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding
Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng

TL;DR
UniCMs introduces a unified consistency model for multimodal generation and understanding, achieving superior performance and faster sampling in text-to-image and image-to-text tasks by combining discrete diffusion and autoregressive decoding.
Contribution
The paper proposes a novel unified consistency model that integrates discrete diffusion and autoregressive decoding for multimodal tasks, improving efficiency and performance.
Findings
Outperforms SD3 on GenEval, Image Reward, and CLIP Score
Requires only 1/8 of the sampling time compared to SD3
Surpasses Show-o on the MMMU benchmark with faster long-sequence generation
Abstract
Consistency models (CMs) have shown promise in the efficient generation of both image and text. This raises the natural question of whether we can learn a unified CM for efficient multimodal generation (e.g., text-to-image) and understanding (e.g., image-to-text). Intuitively, such a model could be acquired by applying the consistency distillation (CD) to existing unified multimodal models. However, the key challenge is establishing a unified denoising perspective for both image and text generation, which is essential for establishing the consistency mapping. To tackle this, at the representation level, we advocate for discrete tokens for both modalities to best preserve language modeling capabilities. Critically, instead of defining the text denoising trajectory via recent discrete diffusion language modeling principles, we specify it using the parallel decoding trace of an…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper is well-organized and easy to read and follow. 2. The overall motivation is clear and logical. 3. The distilled model achieves significant acceleration while preserving the multimodal understanding and generation performance.
1. I thought it was necessary to study the acceleration in unified multimodal models. I was wondering if there are unique challenges in this area instead of directly adopting the distillation approaches well-studied in large language models and consistency models. 2. It looks like there are significant performance drops in several metrics in the T2I and MMU evaluations. 3. Is it possible to wrap the proposed pipeline into a general acceleration, as a lot of unified multimodal models have be
- The paper is well written and easy to follow - The numerical results are very good, their distilled version of Show-o is almost as capable as the original model but cutting down the computational cost significantly. - The experimental study is comprehensive with good ablations studies
- The consistency loss for discrete diffusion is not very consistent. The distilled probability distributions would have to include the correlations between different tokens, however, in the current presentation, these are dropped, and it is unclear how the model is managing to actually distill things if tokens are being generated independently. A work that has studied this problem is [1] from which it can be seen the importance of learning such correlations. Without further explanation of this
1. Practical speedup: UniCMs deliver significant acceleration, which is valuable for real-world deployment. 2. Comprehensive evaluation: The experiments are thorough, with strong baselines and ablations. 3. Clarity: The paper is clearly written and well-illustrated.
1. Limited Technical Contribution: The main advance is in speed, not in new algorithms or modeling paradigms. The technical novelty is incremental. 2. Performance not state-of-the-art: UniCMs do not outperform the best existing models on all benchmarks; there is a clear trade-off between speed and accuracy. 3. Initialization from Show-o: The framework is initialized with Show-o’s architecture and parameters, and the resulting performance is only comparable to Show-o, which further limits the ori
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsDiffusion · Contrastive Language-Image Pre-training · Consistency Models
