Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
Prasanna Desikan,Harshit Rajgarhia, Shivali Dalmia, Ananya Mantravadi

TL;DR
This paper emphasizes the need for comprehensive benchmarks to evaluate AI models in healthcare, focusing on reliability, safety, and clinical relevance in real-world settings beyond traditional performance metrics.
Contribution
It proposes a systematic framework for benchmarking generative, multimodal, and agentic AI in healthcare to better assess their practical utility and safety.
Findings
Current benchmarks often overestimate clinical readiness.
Performance drops significantly on real clinical tasks.
Existing benchmarks mainly test knowledge, not reliability.
Abstract
AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and validation datasets were never designed to capture. Evaluating these systems requires benchmarks: structured combinations of tasks, datasets, and metrics that enable reproducible, comparable measurement of what a model can do. The central challenge in healthcare AI is not performance alone, but the absence of systematic methods to measure reliability, safety, and clinical relevance under real-world conditions. Most existing benchmarks test what a model knows; too few test whether it can perform reliably and without failing across the full complexity of real clinical tasks. Current benchmarks have accumulated through ad hoc dataset construction optimized for narrow task performance: frontier models achieve near-perfect scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
