SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards
Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, Rui Yan

TL;DR
SUDER introduces a self-supervised dual reward framework that leverages the inverse relationship between understanding and generation tasks to improve large multimodal models' alignment and performance without external supervision.
Contribution
The paper proposes a novel dual self-reward mechanism exploiting the inverse nature of understanding and generation tasks to enhance multimodal models.
Findings
Significant improvement in text-to-image generation quality.
Enhanced vision-language alignment without external supervision.
Effective self-improvement of multimodal models through dual rewards.
Abstract
Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
