From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou; John Giorgi; Pranav Mani; Peng Xu; Davis Liang; Chenhao Tan

arXiv:2507.17717·cs.CL·October 10, 2025

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan

PDF

Open Access 1 Video

TL;DR

This paper introduces a structured checklist derived from real user feedback to evaluate AI-generated clinical notes, improving alignment with physician preferences and offering a scalable, interpretable assessment method.

Contribution

It presents a novel pipeline that distills human feedback into interpretable checklists for clinical note evaluation, outperforming baseline metrics in alignment and robustness.

Findings

01

Checklist outperforms baseline in coverage and diversity

02

Strong alignment with clinician preferences

03

Robust to quality perturbations

Abstract

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes· underline

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education