MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou,, Andrew Y. Ng, Jonathan H. Chen

TL;DR
MedAgentBench provides a comprehensive, realistic benchmark environment for evaluating large language model agents in medical record tasks, highlighting current capabilities and areas for improvement.
Contribution
Introduces MedAgentBench, a standardized, interactive, and clinically-derived benchmark suite for assessing LLM agent performance in medical applications.
Findings
Claude 3.5 Sonnet v2 achieves 69.67% success rate
Significant variation in performance across task categories
Current models show room for substantial improvement
Abstract
Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · Biosimilars and Bioanalytical Methods
