MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning
Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Zihan Dong, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Linjun Zhang, Shujie Liu, Yan Lu, Huaxiu Yao

TL;DR
This paper introduces MMedAgent-RL, a reinforcement learning framework for dynamic multi-agent collaboration in multimodal medical reasoning, significantly improving diagnostic accuracy across diverse specialties.
Contribution
It presents a novel RL-based multi-agent system with curriculum learning for flexible, optimized medical reasoning, surpassing existing static collaboration models.
Findings
Achieves 23.6% average performance improvement over baselines.
Demonstrates effectiveness across five medical VQA benchmarks.
Enables dynamic, adaptive collaboration among medical agents.
Abstract
Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own…
Peer Reviews
Decision·ICLR 2026 Poster
1. Framing multi-agent medical reasoning as a curriculum RL problem with dynamic entropy control is well-motivated by the reality of imperfect expert judgments. 2. Strong empirical results, 23.6% average gain over baselines and excellent OOD generalization (72.6% on MMMU/OmniMedVQA) demonstrate effectiveness. 3. The three-stage curriculum (easy/medium/hard based on specialist accuracy) with corresponding entropy coefficients (0.0001/0.005/0.03) is principled and clearly explained.
1. Missing critical baselines: No comparison with simpler alternatives that could possible achieve similar results, eg. single GPT-4o or Qwen2.5-VL sampling N diverse outputs using different prompts or high temperatures → majority voting or trained aggregator. These would test if the complex triage+multi-expert pipeline is necessary. 2. The paper claims the attending physician learns to "correct specialist mistakes," but provides no quantitative evidence on hard cases where all specialists fai
1. The paper is well-written, logically clear, and easy to follow. 2. The theoretical derivations are fairly sound. 3. Extensive experiments demonstrate the superiority of the proposed MMedAgent-RL.
1. The middle part of Figure 1(a) does not reflect the practical workflow of Multi-Agent collaboration; it seems to lack representation of the General Practitioner, which leads to ambiguity. 2. Section 3.1 mentions optimizing the triage doctor using GRPO, so it would be worthwhile to discuss the triage doctor's capability (quantitatively) as well as its reasoning process. 3. The underlying mechanism for the entropy regularization term in Equation 3.1 needs to be explained, and the rationale behi
(1) This framework develops a machine that can adjust collaboration policies based on task difficulty. The integration of C-MARL for entropy control is theoretically motivated. (2) The evaluation and experimental design are comprehensive. The proposed framework shows good performance in 5 public datasets, including both in-domain and out-of-distribution datasets.
(1) The framework is inspired by the 'triage–specialist–attending'. The authors need to find more evidence to demonstrate that this aligns with the real hospitalization process. Within different sections in a hospital, the workflow may differ. (2) This work lacks the involvement of human experts. (3) Some of the technical details are missing. For example, are the first GP and the second GP updated simultaneously?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
