Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters
Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez

TL;DR
This paper introduces a case-specific rubric methodology for clinical AI evaluation, demonstrating that LLM-generated rubrics can approximate clinician agreement at a fraction of the cost, enabling scalable and valid assessments.
Contribution
It presents a novel clinician-authored rubric approach validated by LLMs, significantly reducing evaluation costs while maintaining clinical relevance and expert judgment.
Findings
Clinician rubrics effectively discriminate between high- and low-quality outputs.
LLM-based rankings match or exceed clinician-clinician agreement levels.
LLM rubrics enable evaluation coverage at roughly 1,000 times lower cost.
Abstract
Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
