TL;DR
MM-R5 is a multimodal reranking model that uses reinforcement learning and reasoning chains to improve document retrieval accuracy across multiple domains, achieving state-of-the-art results.
Contribution
The paper introduces MM-R5, a novel multimodal reranker trained with a two-stage process including reasoning-focused supervised fine-tuning and reinforcement learning, enhancing retrieval precision.
Findings
Achieves state-of-the-art performance on MMDocIR benchmark.
Improves recall@1 by over 4% compared to previous methods.
Effectively utilizes reasoning chains and reinforcement learning for multimodal reranking.
Abstract
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
