Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

TL;DR
Show-o2 introduces advanced unified multimodal models that leverage autoregressive and flow matching techniques within a 3D variational autoencoder framework, enabling scalable and versatile multimodal understanding and generation across images, videos, and text.
Contribution
The paper presents a novel unified multimodal modeling approach using autoregressive and flow matching methods within a 3D autoencoder, enhancing scalability and versatility.
Findings
Demonstrates effective multimodal understanding and generation across diverse modalities.
Achieves scalability to larger models with a two-stage training process.
Models show versatility in handling text, images, and videos.
Abstract
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗showlab/show-omodel· 69 dl· ♡ 1769 dl♡ 17
- 🤗showlab/show-o-w-clip-vitmodel· 42 dl· ♡ 242 dl♡ 2
- 🤗showlab/show-o-512x512-wo-llava-tuningmodel· 11 dl· ♡ 111 dl♡ 1
- 🤗showlab/show-o2-1.5Bmodel· 156 dl· ♡ 7156 dl♡ 7
- 🤗showlab/show-o2-7Bmodel· 853 dl· ♡ 15853 dl♡ 15
- 🤗showlab/show-o2-1.5B-HQmodel· 63 dl· ♡ 363 dl♡ 3
- 🤗showlab/show-o2-1.5B-w-video-undmodel· 11 dl11 dl
- 🤗showlab/show-o2-7B-w-video-undmodel· 4 dl· ♡ 24 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
