The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Alvin Rajkomar; Pavan Sudarshan; Angela Lai; Lily Peng

arXiv:2603.18294·cs.AI·April 17, 2026

The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Alvin Rajkomar, Pavan Sudarshan, Angela Lai, Lily Peng

PDF

TL;DR

This study reveals a significant gap in health AI benchmarks, showing they lack representation of real-world clinical populations, complex data types, and safety-critical scenarios, which may mislead model readiness assessments.

Contribution

The paper introduces a standardized taxonomy for profiling health AI benchmark queries and highlights the misalignment between benchmark composition and clinical realities.

Findings

01

Benchmarks lack complex diagnostic data like lab values and imaging.

02

Safety-critical and vulnerable populations are underrepresented.

03

Clinical composition remains misaligned with real-world healthcare needs.

Abstract

Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.