Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench

Fred Mutisya (1,2); Shikoh Gitau (1); Nasubo Ongoma (1); Keith Mbae (1); Elizabeth Wamicha (1)

arXiv:2508.00081·cs.AI·August 4, 2025

Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench

Fred Mutisya (1,2), Shikoh Gitau (1), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1)

PDF

Open Access

TL;DR

This paper critically evaluates the limitations of the HealthBench benchmark in medical language models, highlighting biases and regional disparities, and proposes a new framework grounded in clinical guidelines and systematic evidence to improve global relevance and trustworthiness.

Contribution

It introduces an evidence-based reward framework using clinical practice guidelines and systematic reviews to enhance the clinical validity and fairness of medical language model evaluation.

Findings

01

Identifies biases in current benchmarks due to expert opinion reliance.

02

Proposes a new reward system anchored in clinical guidelines and systematic reviews.

03

Aims to improve global relevance and ethical standards in medical AI evaluation.

Abstract

HealthBench, a benchmark designed to measure the capabilities of AI systems for health better (Arora et al., 2025), has advanced medical language model evaluation through physician-crafted dialogues and transparent rubrics. However, its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies, further compounded by potential biases in automated grading systems. These limitations are particularly magnified in low- and middle-income settings, where issues like sparse neglected tropical disease coverage and region-specific guideline mismatches are prevalent. The unique challenges of the African context, including data scarcity, inadequate infrastructure, and nascent regulatory frameworks, underscore the urgent need for more globally relevant and equitable benchmarks. To address these shortcomings, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Electronic Health Records Systems