Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions
Naveen Raman, Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, Bryan Wilder

TL;DR
This paper argues that healthcare LLM benchmark limitations stem from untested implicit assumptions about user interaction, proposing a framework and tools to better evaluate and address these assumptions.
Contribution
It introduces a classification of assumptions into task and outcome categories, along with BenchmarkCards and staged evaluation for systematic assumption testing.
Findings
Retrospective analysis shows task and outcome gaps are roughly equal.
BenchmarkCards help document explicit assumptions.
Staged evaluation systematically tests assumptions and performance.
Abstract
Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
