Ovis-U1 Technical Report

Guo-Hua Wang; Shanshan Zhao; Xinjie Zhang; Liangfu Cao; Pengxin Zhan; Lunhao Duan; Shiyin Lu; Minghao Fu; Xiaohao Chen; Jianshan Zhao; Yang Li; Qing-Guo Chen

arXiv:2506.23044·cs.CV·July 2, 2025

Ovis-U1 Technical Report

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen

PDF

Open Access 1 Repo 3 Models

TL;DR

Ovis-U1 is a large, unified multimodal model that integrates understanding, generation, and editing, achieving state-of-the-art performance across multiple benchmarks by combining these capabilities in a single system.

Contribution

It introduces a novel unified training approach for multimodal tasks, combining understanding and generation to improve overall performance in a single model.

Findings

01

Achieves 69.6 on OpenCompass benchmark, surpassing recent models.

02

Scores 83.72 on DPG-Bench for text-to-image generation.

03

Attains 6.42 on GEdit-Bench for image editing.

Abstract

In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aidc-ai/ovis-u1
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship