HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Rebecca Soskin Hicks; Mikhail Trofimov; Dominick Lim; Rahul K. Arora; Foivos Tsimpourlas; Preston Bowman; Michael Sharman; Chi Tong; Kavin Karthik; Arnav Dugar; Akshay Jagadeesh; Khaled Saab; Johannes Heidecke; Ashley Alexander; Nate Gross; Karan Singhal

arXiv:2604.27470·cs.CL·May 1, 2026

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, Karan Singhal

PDF

TL;DR

HealthBench Professional is an open benchmark that evaluates large language models on real clinician tasks, including care consults, documentation, and research, using physician-scored examples to measure progress.

Contribution

It introduces a new, carefully curated benchmark with real-world clinical examples and scoring rubrics to assess LLM performance in healthcare settings.

Findings

01

GPT-5.4 in ChatGPT for Clinicians outperforms other models and human physicians.

02

Benchmark includes difficult and adversarial examples to challenge models.

03

Human physician responses serve as a high-quality baseline.

Abstract

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI's current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.