HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K. Arora; Jason Wei; Rebecca Soskin Hicks; Preston Bowman; Joaquin Qui\~nonero-Candela; Foivos Tsimpourlas; Michael Sharman; Meghan Shah; Andrea Vallone; Alex Beutel; Johannes Heidecke; Karan Singhal

arXiv:2505.08775·cs.CL·May 14, 2025

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui\~nonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal

PDF

1 Repo 1 Models

TL;DR

HealthBench is a comprehensive, open-source benchmark designed to evaluate large language models in healthcare through realistic, multi-turn conversations and detailed rubric-based assessments, tracking progress and safety.

Contribution

It introduces a novel, open-ended healthcare benchmark with extensive rubric criteria and multiple variations, enabling more realistic evaluation of language models in health contexts.

Findings

01

Steady progress from GPT-3.5 Turbo to GPT-4o

02

Rapid recent improvements in model performance

03

Smaller models like GPT-4.1 nano outperform larger counterparts

Abstract

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openai/simple-evals
noneOfficial

Models

🤗
MedAIBase/AntAngelMed
model· 130 dl· ♡ 81
130 dl♡ 81

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Byte Pair Encoding · Attention Dropout · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer