SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

Jixiang Hong; Yiran Zhang; Guanzhong Wang; Yi Liu; Ji-Rong Wen; Rui Yan

arXiv:2506.07963·cs.AI·September 9, 2025

SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, Rui Yan

PDF

Open Access

TL;DR

SUDER introduces a self-supervised dual reward framework that leverages the inverse relationship between understanding and generation tasks to improve large multimodal models' alignment and performance without external supervision.

Contribution

The paper proposes a novel dual self-reward mechanism exploiting the inverse nature of understanding and generation tasks to enhance multimodal models.

Findings

01

Significant improvement in text-to-image generation quality.

02

Enhanced vision-language alignment without external supervision.

03

Effective self-improvement of multimodal models through dual rewards.

Abstract

Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems