Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

Amy Rouillard; Sitwala Mundia; Linda Camara; Michael Cameron Gramanie; Ziyaad Dangor; Ismail Kalla; Shabir A. Madhi; Kajal Morar; Marlvin T. Ncube; Haroon Saloojee; Bruce A. Bassett

arXiv:2604.14892·cs.LG·April 20, 2026

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

Amy Rouillard, Sitwala Mundia, Linda Camara, Michael Cameron Gramanie, Ziyaad Dangor, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

PDF

TL;DR

This study evaluates whether a calibrated, multi-model LLM jury can reliably replicate expert clinician panel assessments in medical diagnosis, showing promise for efficient AI benchmarking.

Contribution

It demonstrates that a calibrated LLM jury can match expert panel evaluations, reducing costs and improving efficiency in medical AI assessment.

Findings

01

LLM jury scores are systematically lower than clinician scores

02

LLM jury shows better agreement with primary expert panels than re-scoring panels

03

Calibration improves alignment with human evaluations

Abstract

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.