PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu; Imran Q. Mohiuddin; Austin J. Schoeffler; Kavita Renduchintala; Ashwin Nayak; Prasantha L. Vemu; Shivam C. Vedak; Kameron C. Black; John L. Havlik; Isaac Ogunmola; Stephen P. Ma; Roopa Dhatt; and Jonathan H. Chen

arXiv:2605.02240·cs.AI·May 5, 2026

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black, John L. Havlik, Isaac Ogunmola, Stephen P. Ma, Roopa Dhatt, and Jonathan H. Chen

PDF

1 Repo

TL;DR

PhysicianBench is a comprehensive benchmark for evaluating large language model agents on complex, real-world clinical tasks within electronic health record environments, highlighting current limitations.

Contribution

It introduces a new benchmark with 100 real clinical tasks across multiple specialties, grounded in actual EHR workflows and verified by physicians.

Findings

01

Best LLM agents achieve only 46% success rate.

02

Open-source models reach at most 19% success.

03

Benchmark reveals significant gap in current AI clinical capabilities.

Abstract

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

healthrex/PhysicianBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.