BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen; Zhiyang Xu; Xichen Pan; Yushi Hu; Can Qin; Tom Goldstein; Lifu Huang; Tianyi Zhou; Saining Xie; Silvio Savarese; Le Xue; Caiming Xiong; Ran Xu

arXiv:2505.09568·cs.CV·May 15, 2025

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces BLIP3-o, a unified multimodal model that combines image understanding and generation using a diffusion transformer and a sequential training strategy, achieving state-of-the-art results and open-sourcing all resources.

Contribution

The paper presents a novel diffusion transformer architecture, a sequential pretraining approach, and a curated dataset for unified multimodal modeling, advancing both image understanding and generation capabilities.

Findings

01

Higher training efficiency and improved generative quality with diffusion transformer.

02

Sequential pretraining preserves understanding while enhancing generation.

03

State-of-the-art performance on multiple benchmarks.

Abstract

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiuhaichen/blip3o
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies

MethodsSoftmax · Attention Is All You Need · Diffusion · Sparse Evolutionary Training · Contrastive Language-Image Pre-training