Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
Lehan Wang, Yi Qin, Honglong Yang, Xiaomeng Li

TL;DR
This paper introduces Med-RwR, a multimodal medical reasoning framework that actively retrieves external knowledge during diagnosis, improving accuracy and generalizability in medical large language models.
Contribution
The paper presents the first multimodal reasoning-with-retrieval framework for medical LLMs, integrating visual and textual information with a novel reinforcement learning strategy.
Findings
Significant performance improvements over baselines on medical benchmarks.
8.8% gain on EchoCardiography Benchmark (ECBench).
Effective external knowledge integration enhances reasoning accuracy.
Abstract
Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage…
Peer Reviews
Decision·Submitted to ICLR 2026
1. the proposed method is solid and supported by comprehensive experimental settings and ablation studies. 2. the target problem it aims to address, agentic retrieval and multimodal information fusion, is important for the medical analysis domain.
Unclear hyperparameter design: the reward function is composite, but the weights assigned to each component vary widely without sufficient explanation or justification. It’s unclear how these weights were determined or whether any sensitivity analysis was performed. Inappropriate RAG baselines: Although the authors compare their approach with a training-free RAG setup (see Figure 3), they don’t clearly specify what the base model is (is it Med-RWR, and is it based on qwen or lingshu?) In additi
1. The paper is well-written, with clear logic and is easy to understand. 2. Extensive experiments on multiple benchmarks demonstrate the superior performance of the proposed MED-RWR. 3. Comprehensive ablation analysis shows the effectiveness of each component.
1. Why does Equation 2 only compute the semantic similarity between the image and the query, instead of also considering the retrieved content as in Equation 1? 2. Section 3.1 mentions that the difficulty levels of the samples were stratified during dataset construction for progressive curriculum training, but curriculum training does not appear to be utilized subsequently in the methodology. 3. In line 309, it is mentioned that “We apply accuracy and format rewards to instill the model’s fundam
- The paper tackles a critical and high-stakes problem in medical AI: the unreliability and hallucination of MLLMs, which stems from their reliance on static internal knowledge. The proposed reasoning-with-retrieval approach is a well-motivated solution. - The reward engineering is a key strength. The Query Semantic Reward is novel for jointly encouraging textual relevance and visual grounding. Furthermore, the Confidence Gain Reward is an nice way to optimize for the utility of the retrieved in
- Lack the Ethics statement and Reproducibility statement in the main text. - The CDIR mechanism, while interesting, has questionable scalability. At inference time, it computes image similarity against a randomly selected subset of 10,000 images from the multimodal corpus. This selection seems arbitrary, and the paper does not address how this method would scale to a more realistic, much larger corpus (e.g., PubMedVision corpus). - The framework's retrieval mechanism is effectively limited to a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Healthcare · Topic Modeling
