Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe
Erin Palm, Astrit Manikantan, Mark E. Pepin, Herprit Mahal, Srikanth Subramanya Belwadi

TL;DR
This study validates a method to evaluate AI-generated clinical notes using a standardized instrument, showing AI notes are nearly comparable in quality to expert-drafted notes across multiple medical specialties.
Contribution
We developed and validated a blinded evaluation framework using PDQI9 to compare AI-generated and human-authored clinical notes, establishing a practical quality assessment approach.
Findings
High inter-rater agreement in note evaluations across specialties
AI notes scored nearly as high as human notes in quality
PDQI9 is effective for assessing AI-generated clinical documentation
Abstract
In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
