Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness

Alexandra DeLucia; Heyuan Huang; Sonal Joshi; Mahsa Yarmohammadi; Ahmed Hassoon; Mark Dredze

arXiv:2604.16383·cs.CY·April 21, 2026

Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness

Alexandra DeLucia, Heyuan Huang, Sonal Joshi, Mahsa Yarmohammadi, Ahmed Hassoon, Mark Dredze

PDF

TL;DR

This study evaluates the reliability of LLM-based judges in assessing medical chatbot responses, revealing significant discrepancies with clinicians and limited utility for triage in high-stakes medical contexts.

Contribution

It provides a comprehensive stress-test of LLM judges across multiple models and datasets, highlighting their limitations in medical response evaluation.

Findings

01

LLM judges perform near chance in detecting incomplete responses.

02

Clinicians still review most responses to ensure completeness.

03

Discrepancies stem from different standards and detection failures.

Abstract

LLM-as-a-Judge frameworks are increasingly trusted to automate evaluation in place of human experts, yet their reliability in high-stakes medical contexts remains unproven. We stress-test this assumption for detecting incomplete patient-facing medical responses, evaluating three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) and three backbone models across two clinician-annotated datasets, including HealthBench, the largest publicly available benchmark for medical response evaluation. LLM Judges discriminate complete from incomplete responses at and slightly above near chance (AUC $0.49$ -- $0.66$ ); at the threshold required to recall $90%$ of incomplete responses, clinicians must still review the vast majority of the dataset, offering no triage utility. Even when model and clinician verdicts agree, they rarely cite the same explanation; and when they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.