MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

Shengwei Zhao; Jingwen Yao; Sitong Wei; Linhai Xu; Yuying Liu; Dong Zhang; Zhiqiang Tian; Shaoyi Du

arXiv:2512.17194·cs.AI·December 22, 2025

MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

Shengwei Zhao, Jingwen Yao, Sitong Wei, Linhai Xu, Yuying Liu, Dong Zhang, Zhiqiang Tian, Shaoyi Du

PDF

Open Access 1 Video

TL;DR

This paper introduces a two-stage reinforcement learning framework for multi-modal retrieval-augmented generation, improving explainability and reasoning in large language models for complex multi-modal tasks.

Contribution

The paper proposes a novel two-stage reinforcement fine-tuning approach that enhances reasoning and explainability in multi-modal retrieval-augmented generation models.

Findings

01

Achieves state-of-the-art results on WebQA and MultimodalQA datasets.

02

Effectively filters irrelevant documents using rule-based reinforcement.

03

Jointly optimizes ranking and answer generation for explainability.

Abstract

Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)