Ovis-Image Technical Report

Guo-Hua Wang; Liangfu Cao; Tianyu Cui; Minghao Fu; Xiaohao Chen; Pengxin Zhan; Jianshan Zhao; Lan Li; Bowen Fu; Jiaqi Liu; Qing-Guo Chen

arXiv:2511.22982·cs.CV·December 1, 2025

Ovis-Image Technical Report

Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen

PDF

Open Access 1 Models

TL;DR

Ovis-Image is a compact 7B text-to-image model optimized for high-quality text rendering, achieving performance comparable to larger models while remaining deployable on limited hardware.

Contribution

It introduces a new efficient 7B model that combines a diffusion decoder with a strong multimodal backbone and a specialized training pipeline for high-quality text rendering.

Findings

01

Achieves performance on par with larger models like Qwen-Image.

02

Operable on a single high-end GPU with moderate memory.

03

Demonstrates effective bilingual text rendering without oversized models.

Abstract

We introduce $Ovis-Image$ , a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AIDC-AI/Ovis-Image-7B
model· 612 dl· ♡ 205
612 dl♡ 205

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Multimodal Machine Learning Applications