Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie; Zhenheng Yang; Mike Zheng Shou

arXiv:2506.15564·cs.CV·September 23, 2025

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

PDF

Open Access 1 Repo 8 Models

TL;DR

Show-o2 introduces advanced unified multimodal models that leverage autoregressive and flow matching techniques within a 3D variational autoencoder framework, enabling scalable and versatile multimodal understanding and generation across images, videos, and text.

Contribution

The paper presents a novel unified multimodal modeling approach using autoregressive and flow matching methods within a 3D autoencoder, enhancing scalability and versatility.

Findings

01

Demonstrates effective multimodal understanding and generation across diverse modalities.

02

Achieves scalability to larger models with a two-stage training process.

03

Models show versatility in handling text, images, and videos.

Abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/show-o
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques