A Clinical Trial Design Approach to Auditing Language Models in Healthcare Setting
Lovedeep Gondara, Jonathan Simkin

TL;DR
This paper introduces a clinical trial-inspired audit mechanism for evaluating healthcare language models, ensuring statistical rigor and minimal sample use, demonstrated through a real-world public health example.
Contribution
It proposes a novel audit framework based on clinical trial design principles for assessing healthcare language models, emphasizing sample efficiency and statistical validity.
Findings
Effective sample size calculation for audits
Maintains audit integrity with minimal data
Validated in a large-scale public health setting
Abstract
We present an audit mechanism for language models, with a focus on models deployed in the healthcare setting. Our proposed mechanism takes inspiration from clinical trial design where we posit the language model audit as a single blind equivalence trial, with the comparison of interest being the subject matter experts. We show that using our proposed method, we can follow principled sample size and power calculations, leading to the requirement of sampling minimum number of records while maintaining the audit integrity and statistical soundness. Finally, we provide a real-world example of the audit used in a production environment in a large-scale public health network.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Electronic Health Records Systems · Clinical practice guidelines implementation
MethodsFocus
