M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation

Zhiyou Xiao; Qinhan Yu; Binghui Li; Geng Chen; Chong Chen; Wentao Zhang

arXiv:2508.06328·cs.IR·August 11, 2025

M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation

Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, Wentao Zhang

PDF

Open Access

TL;DR

This paper introduces M2IO-R1, an RL-based framework that enables multimodal inputs and outputs for retrieval-augmented generation, improving reasoning, quality, and efficiency in multimodal tasks.

Contribution

It presents a novel RL-enhanced framework supporting multimodal outputs, with a specialized inserter trained for semantic alignment and efficiency in multimodal generation.

Findings

01

Outperforms baselines in quality and efficiency

02

Achieves strong reasoning with a lightweight 3B model

03

Reduces latency significantly

Abstract

Current research on Multimodal Retrieval-Augmented Generation (MRAG) enables diverse multimodal inputs but remains limited to single-modality outputs, restricting expressive capacity and practical utility. In contrast, real-world applications often demand both multimodal inputs and multimodal outputs for effective communication and grounded reasoning. Motivated by the recent success of Reinforcement Learning (RL) in complex reasoning tasks for Large Language Models (LLMs), we adopt RL as a principled and effective paradigm to address the multi-step, outcome-driven challenges inherent in multimodal output generation. Here, we introduce M2IO-R1, a novel framework for Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) that supports both multimodal inputs and outputs. Central to our framework is an RL-based inserter, Inserter-R1-3B, trained with Group Relative Policy Optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems