MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Qian Liang; Yujia Wu; Kuncheng Li; Jiwei Wei; Shiyuan He; Jinyu Guo; Ning Xie

arXiv:2508.11433·cs.CV·August 27, 2025

MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Qian Liang, Yujia Wu, Kuncheng Li, Jiwei Wei, Shiyuan He, Jinyu Guo, Ning Xie

PDF

1 Video

TL;DR

This paper introduces MM-R1, a novel framework that leverages a cross-modal reasoning strategy within unified multimodal large language models to enable scalable, personalized image generation with high fidelity and alignment without extensive fine-tuning.

Contribution

MM-R1 is the first approach to integrate a cross-modal Chain-of-Thought reasoning strategy for personalized image generation using unified MLLMs, eliminating the need for subject-specific fine-tuning.

Findings

01

Achieves high subject fidelity in generated images.

02

Demonstrates strong text-image alignment in zero-shot settings.

03

Effectively grounds user-provided images and prompts for personalized generation.

Abstract

Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation· underline