Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Ming Nie; Chunwei Wang; Jianhua Han; Hang Xu; Li Zhang

arXiv:2603.09538·cs.CV·March 11, 2026

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning strategy to enable existing unified vision-language models to generate coherent multimodal interleaved outputs, crucial for tasks like visual storytelling, without requiring large-scale interleaved datasets.

Contribution

It proposes a novel policy optimization framework extending GRPO to multimodal generation, incorporating hybrid rewards and process-level guidance to improve interleaved output quality.

Findings

01

Significant improvement in interleaved generation quality and coherence.

02

Effective training with hybrid rewards and process-level guidance.

03

Validated on MMIE and InterleavedBench datasets.

Abstract

Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning