UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Chi Zhang; Jiepeng Wang; Youming Wang; Yuanzhi Liang; Xiaoyan Yang; Zuoxin Li; Haibin Huang; Xuelong Li

arXiv:2511.16917·cs.CV·November 24, 2025

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang, Zuoxin Li, Haibin Huang, Xuelong Li

PDF

Open Access

TL;DR

UniModel introduces a unified visual-only diffusion framework that jointly supports multimodal understanding and generation by translating all modalities into a shared pixel space, enabling versatile vision-language tasks.

Contribution

The paper presents a novel pixel-to-pixel diffusion model that unifies multimodal tasks by representing text and images in a shared visual space, eliminating modality discrepancies.

Findings

01

Strong cross-modal alignment demonstrated in experiments.

02

Emergent controllability such as cycle-consistent image-caption-image loops.

03

Effective for both text-to-image synthesis and image-to-text understanding.

Abstract

We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism