UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin; Zongjian Li; Xinhua Cheng; Yuwei Niu; Yang Ye; Xianyi He; Shenghai Yuan; Wangbo Yu; Shaodong Wang; Yunyang Ge; Yatian Pang; and Li Yuan

arXiv:2506.03147·cs.CV·June 23, 2025

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, and Li Yuan

PDF

Open Access 2 Repos 2 Models 1 Datasets

TL;DR

UniWorld-V1 introduces high-resolution semantic encoders for unified visual understanding and generation, demonstrating strong performance across diverse tasks with limited training data, and emphasizing the importance of semantic features over VAEs.

Contribution

The paper presents UniWorld-V1, a novel unified generative framework leveraging semantic encoders from multimodal models, outperforming traditional methods in image perception and manipulation tasks.

Findings

01

UniWorld-V1 achieves high performance with only 2.7M training data.

02

Semantic encoders are more effective than VAEs for image manipulation.

03

Open-source release promotes reproducibility and further research.

Abstract

Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

LanguageBind/UniWorld-V1
dataset· 4.3k dl
4.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Healthcare and Education