Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

TL;DR
Show-o is a unified transformer model that combines autoregressive and diffusion techniques to handle diverse multimodal understanding and generation tasks, achieving competitive performance across multiple benchmarks.
Contribution
It introduces a novel unified transformer architecture that seamlessly integrates autoregressive and diffusion modeling for versatile multimodal tasks.
Findings
Achieves comparable or superior performance to specialized models on various benchmarks.
Supports a wide range of vision-language tasks including VQA, text-to-image, and inpainting.
Demonstrates the potential of a single model as a next-generation foundation for multimodal AI.
Abstract
We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at https://github.com/showlab/Show-o.
Peer Reviews
Decision·ICLR 2025 Poster
1. The writing is good and clear. 2. The paper introduces a compelling study on Show-o, a unified MLLM, which compares with various MLLM models, showcasing its promise in both multimodal understanding and generation tasks. 3. A wide array of experiments is conducted to showcase the effectiveness of Show-o.
1. This work lacks comparisons with other unified models like Vila-u, Transfusion, and Emu3... 2. Show-o appears to utilize special tokens like [mmu] for multimodal understanding and [T2I] for visual generation to handle different tasks separately. The design raises questions about its necessity, as it explicitly reveals task information. Should a new special token be introduced for each new task? 3. I'm curious about Show-o's ability to manage interleaved input and output.
- Solid experiments, comprehensive comparison with existing works. - Investigate a new paradigm of combining auto-regressive and discrete diffusion modeling for unifying understanding and generation. - Extensive ablations for scaling up data, resolution, and visual representation (discrete/continuous and different pertaining). - The paper is well-written and easy to follow.
- The model incurs significant training costs, making it challenging to scale, with potentially slower inference speeds. - The benefits of unifying understanding and generation in a single model are unclear, especially given its lower performance in both areas compared to specialized unimodal models.
- The paper compares against a lot of existing baselines on a lot of benchmarks thus indicating dense evaluation. - The results are strong and competitive on image-text understanding and generation tasks across strong baselines. - Good presentation and writing, a lot of well-made figures makes the understanding of the paper easy. - Extension to videos and mixed modality makes the paper more applicable
- The paper doesn't do a good job at convincing why should one use discrete diffusion on Images instead of AR like Chameleon. While the introduction motivates the use of diffusion in images by discussing the number of forward evals, there are no emperical experiments showcasing this benefit. - The comparisions with baselines is difficult to asses due to the use of different training/evaluation sets for each of the baselines. A human eval, with human deciding the starting prompt might make the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsDiffusion
