Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang

TL;DR
This paper introduces a novel intrinsic reward mechanism and self-supervised reinforcement learning framework to enhance the generative capabilities of unified multimodal models by leveraging their understanding abilities.
Contribution
It proposes a token-level understanding-driven intrinsic reward (GvU) and a self-supervised RL framework to improve UMMs' generation quality without external supervision.
Findings
Significant improvement in UMMs' image generation quality.
Enhanced fine-grained visual understanding in UMMs.
Narrowing the capability gap between understanding and generation.
Abstract
Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
