OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Sandhanakrishnan Ravichandran; Shivesh Kumar; Rogerio Corga Da Silva; Miguel Romano; Reinhard Berkels; Michiel van der Heijden; Olivier Fail; Valentine Emmanuel Gnanapragasam

arXiv:2509.02594·q-bio.QM·February 18, 2026·2 cites

OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Reinhard Berkels, Michiel van der Heijden, Olivier Fail, Valentine Emmanuel Gnanapragasam

PDF

Open Access

TL;DR

This paper evaluates an LLM-based medical assistant, DR. INFO, using a new rubric-driven benchmark called HealthBench, demonstrating its superior performance in complex clinical scenarios compared to other models.

Contribution

It introduces HealthBench, a comprehensive, expert-annotated benchmark for assessing clinical LLMs in realistic, open-ended medical conversations, and evaluates DR. INFO's performance.

Findings

01

DR. INFO scores 0.68 on Hard subset, outperforming GPT-5 and others.

02

DR. INFO maintains a 0.72 score against similar agents.

03

Highlights strengths in communication and instruction following, with areas for improvement in context awareness.

Abstract

Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stakes clinical scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, contextual awareness, and uncertainty handling. To address these limitations, we evaluate our agentic RAG-based clinical support assistant, DR. INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR. INFO achieves a HealthBench Hard score of 0.68, outperforming leading frontier LLMs including the GPT-5 model family (GPT-5: 0.46, GPT-5.2: 0.42, GPT-5.1: 0.40), Grok 3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling