End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians
Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez

TL;DR
This paper introduces a comprehensive governance framework for clinical AI systems, demonstrated on an EHR-embedded audio-to-chart tool, showing improved performance, effective feedback integration, and reliable operation.
Contribution
It presents an end-to-end governance approach combining validation, feedback, monitoring, and experimentation gating for clinical AI deployment.
Findings
Median scores improved from 84% to 95% across versions.
Feedback shifted from error reports to positive observations over time.
Processing time per audio segment was 8.1 seconds with high completion rate.
Abstract
Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
