Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

Bruce A. Bassett; Amy Rouillard; Sitwala Mundia; Michael Cameron Gramanie; Linda Camara; Ziyaad Dangor; Shabir A. Madhi; Kajal Morar; Marlvin T. Ncube; Ismail Kalla; Haroon Saloojee

arXiv:2604.16980·cs.LG·April 21, 2026

Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

Bruce A. Bassett, Amy Rouillard, Sitwala Mundia, Michael Cameron Gramanie, Linda Camara, Ziyaad Dangor, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Ismail Kalla, Haroon Saloojee

PDF

TL;DR

This study evaluates the diagnostic accuracy, safety, and cost-effectiveness of ten multimodal large language models using real-world inpatient data from a South African hospital, highlighting their potential in LMIC healthcare settings.

Contribution

It provides a comprehensive real-world evaluation of multimodal LLMs in inpatient diagnosis, comparing performance across models and costs in an LMIC context.

Findings

01

All LLMs outperformed routine ward diagnoses on average safety and diagnostic scores.

02

Low-cost models performed comparably to top models despite large cost differences.

03

Adding radiology reports improved diagnostic performance by 6%.

Abstract

Background: Large language models (LLMs) are increasingly proposed for diagnostic support, but few evaluations use real-world multimodal inpatient data, particularly in low and middle-income country (LMIC) public hospitals. Methods: We conducted VALID, a retrospective evaluation of 539 multimodal inpatient cases from a tertiary public hospital in South Africa. Inputs included radiology imaging (CT, MRI, CXR) and reports, laboratory results, clinical notes, and vital signs. Expert panels adjudicated 300 cases (balanced and discordant subsets) to establish ground truth diagnoses, differentials, and reasoning. Ten multimodal LLMs generated zero-shot outputs. A calibrated three-model LLM Jury scored all outputs and routine ward diagnoses across diagnostic accuracy, differential quality, reasoning, and patient safety (>10,000 evaluations). Primary outcomes were composite scores ( $S_{3}$ ,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.