OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu; Zhonghua Wu; Zerui Gong; Qingyi Tao; Sheng Jin; Qinyue Li; Wei Li; Chen Change Loy

arXiv:2505.23661·cs.CV·June 3, 2025

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

PDF

Open Access 1 Repo

TL;DR

OpenUni introduces a lightweight, open-source baseline that unifies multimodal understanding and generation, achieving high-quality image synthesis and strong benchmark performance with minimal complexity.

Contribution

It presents a simple, efficient architecture bridging multimodal LLMs and diffusion models, with released code, weights, and datasets to foster open research.

Findings

01

High-quality, instruction-aligned image generation.

02

Exceptional benchmark performance with few activated parameters.

03

Open-source release of models, code, and datasets.

Abstract

In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wusize/openuni
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsADaptive gradient method with the OPTimal convergence rate · Diffusion · Sparse Evolutionary Training