The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition
Alvin Rajkomar, Pavan Sudarshan, Angela Lai, Lily Peng

TL;DR
This study reveals a significant gap in health AI benchmarks, showing they lack representation of real-world clinical populations, complex data types, and safety-critical scenarios, which may mislead model readiness assessments.
Contribution
The paper introduces a standardized taxonomy for profiling health AI benchmark queries and highlights the misalignment between benchmark composition and clinical realities.
Findings
Benchmarks lack complex diagnostic data like lab values and imaging.
Safety-critical and vulnerable populations are underrepresented.
Clinical composition remains misaligned with real-world healthcare needs.
Abstract
Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
