Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

Jiadong Pan; Liang Li; Yuxin Peng; Yu-Ming Tang; Shuohuan Wang; Yu Sun; Hua Wu; Qingming Huang; Haifeng Wang

arXiv:2603.06043·cs.CV·March 9, 2026

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang

PDF

Open Access

TL;DR

This paper introduces a novel intrinsic reward mechanism and self-supervised reinforcement learning framework to enhance the generative capabilities of unified multimodal models by leveraging their understanding abilities.

Contribution

It proposes a token-level understanding-driven intrinsic reward (GvU) and a self-supervised RL framework to improve UMMs' generation quality without external supervision.

Findings

01

Significant improvement in UMMs' image generation quality.

02

Enhanced fine-grained visual understanding in UMMs.

03

Narrowing the capability gap between understanding and generation.

Abstract

Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling