BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu

TL;DR
This paper introduces BLIP3-o, a unified multimodal model that combines image understanding and generation using a diffusion transformer and a sequential training strategy, achieving state-of-the-art results and open-sourcing all resources.
Contribution
The paper presents a novel diffusion transformer architecture, a sequential pretraining approach, and a curated dataset for unified multimodal modeling, advancing both image understanding and generation capabilities.
Findings
Higher training efficiency and improved generative quality with diffusion transformer.
Sequential pretraining preserves understanding while enhancing generation.
State-of-the-art performance on multiple benchmarks.
Abstract
Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies
MethodsSoftmax · Attention Is All You Need · Diffusion · Sparse Evolutionary Training · Contrastive Language-Image Pre-training
