Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness
Alexandra DeLucia, Heyuan Huang, Sonal Joshi, Mahsa Yarmohammadi, Ahmed Hassoon, Mark Dredze

TL;DR
This study evaluates the reliability of LLM-based judges in assessing medical chatbot responses, revealing significant discrepancies with clinicians and limited utility for triage in high-stakes medical contexts.
Contribution
It provides a comprehensive stress-test of LLM judges across multiple models and datasets, highlighting their limitations in medical response evaluation.
Findings
LLM judges perform near chance in detecting incomplete responses.
Clinicians still review most responses to ensure completeness.
Discrepancies stem from different standards and detection failures.
Abstract
LLM-as-a-Judge frameworks are increasingly trusted to automate evaluation in place of human experts, yet their reliability in high-stakes medical contexts remains unproven. We stress-test this assumption for detecting incomplete patient-facing medical responses, evaluating three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) and three backbone models across two clinician-annotated datasets, including HealthBench, the largest publicly available benchmark for medical response evaluation. LLM Judges discriminate complete from incomplete responses at and slightly above near chance (AUC --); at the threshold required to recall of incomplete responses, clinicians must still review the vast majority of the dataset, offering no triage utility. Even when model and clinician verdicts agree, they rarely cite the same explanation; and when they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
