MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu; Ziyi He; Haoran Hu; Xiaochen Zheng; Xichen Zhang; Zixiang Wang; Junyi Gao; Liantao Ma; Lequan Yu

arXiv:2505.12371·cs.AI·October 31, 2025·2 cites

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

MedAgentBoard is a comprehensive benchmark that systematically evaluates multi-agent collaboration, single-LLM, and conventional methods across diverse medical tasks, revealing nuanced performance insights and guiding AI solution choices in healthcare.

Contribution

Introduces MedAgentBoard, a new benchmark for evaluating multi-agent, single-LLM, and conventional approaches on diverse medical tasks with extensive experiments and open resources.

Findings

01

Multi-agent collaboration benefits specific scenarios like workflow automation.

02

Single LLMs outperform in medical question answering.

03

Conventional methods often outperform in VQA and EHR prediction.

Abstract

The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yhzhu99/medagentboard
noneOfficial

Videos

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks· slideslive

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education