Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

Erin Palm; Astrit Manikantan; Mark E. Pepin; Herprit Mahal; Srikanth Subramanya Belwadi

arXiv:2505.17047·cs.CL·May 26, 2025

Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

Erin Palm, Astrit Manikantan, Mark E. Pepin, Herprit Mahal, Srikanth Subramanya Belwadi

PDF

TL;DR

This study validates a method to evaluate AI-generated clinical notes using a standardized instrument, showing AI notes are nearly comparable in quality to expert-drafted notes across multiple medical specialties.

Contribution

We developed and validated a blinded evaluation framework using PDQI9 to compare AI-generated and human-authored clinical notes, establishing a practical quality assessment approach.

Findings

01

High inter-rater agreement in note evaluations across specialties

02

AI notes scored nearly as high as human notes in quality

03

PDQI9 is effective for assessing AI-generated clinical documentation

Abstract

In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.