MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical   LLM Agents

Yixing Jiang; Kameron C. Black; Gloria Geng; Danny Park; James Zou,; Andrew Y. Ng; Jonathan H. Chen

arXiv:2501.14654·cs.LG·February 13, 2025·3 cites

MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou,, Andrew Y. Ng, Jonathan H. Chen

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

MedAgentBench provides a comprehensive, realistic benchmark environment for evaluating large language model agents in medical record tasks, highlighting current capabilities and areas for improvement.

Contribution

Introduces MedAgentBench, a standardized, interactive, and clinically-derived benchmark suite for assessing LLM agent performance in medical applications.

Findings

01

Claude 3.5 Sonnet v2 achieves 69.67% success rate

02

Significant variation in performance across task categories

03

Current models show room for substantial improvement

Abstract

Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanfordmlgroup/medagentbench
noneOfficial

Models

🤗
Nadhari/Sara-1.5-4B-it
model· 184 dl· ♡ 1
184 dl♡ 1

Datasets

Nadhari/MedToolCalling
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · Biosimilars and Bioanalytical Methods