Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Aaryan Shah; Andrew Hines; Alexia Downs; Denis Bajet; Paulius Mui; Fabiano Araujo; Laura Offutt; Aida Rutledge; Elizabeth Jimenez

arXiv:2604.24710·cs.AI·April 28, 2026

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez

PDF

TL;DR

This paper introduces a case-specific rubric methodology for clinical AI evaluation, demonstrating that LLM-generated rubrics can approximate clinician agreement at a fraction of the cost, enabling scalable and valid assessments.

Contribution

It presents a novel clinician-authored rubric approach validated by LLMs, significantly reducing evaluation costs while maintaining clinical relevance and expert judgment.

Findings

01

Clinician rubrics effectively discriminate between high- and low-quality outputs.

02

LLM-based rankings match or exceed clinician-clinician agreement levels.

03

LLM rubrics enable evaluation coverage at roughly 1,000 times lower cost.

Abstract

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.