UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han; Zhen Fang; XinYu Sun; Yuchen Ma; Ziheng Wang; Yu Zeng; Zehui Chen; Lin Chen; Wenxuan Huang; Wei-Jie Xu; Yi Cao; Feng Zhao

arXiv:2601.03193·cs.CV·January 9, 2026

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao

PDF

Open Access 1 Models

TL;DR

UniCorn introduces a self-improving framework for unified multimodal models that enhances their generative capabilities through self-play and internal supervision, achieving state-of-the-art results without external data.

Contribution

The paper presents UniCorn, a novel self-supervised method that improves multimodal models' generation by partitioning the model into collaborative roles and using self-generated signals, eliminating external supervision.

Findings

01

Achieves SOTA on multiple image generation benchmarks.

02

Significantly improves text-to-image generation quality.

03

Maintains strong multimodal comprehension while enhancing generation.

Abstract

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
CostaliyA/UniCorn
model· 6 dl· ♡ 2
6 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis