A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review
Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V, Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu,, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong, Sun, Yifan Peng, Yanshan Wang

TL;DR
This paper reviews existing methods for human evaluation of large language models in healthcare, highlighting the need for standardization and proposing a comprehensive framework to improve evaluation consistency and reliability.
Contribution
It introduces QUEST, a structured framework for human evaluation of LLMs in healthcare, based on extensive literature review and analysis of current practices.
Findings
Identified diverse evaluation strategies across studies
Highlighted the lack of standardized evaluation methods
Proposed the QUEST framework for consistent assessment
Abstract
With generative artificial intelligence (AI), particularly large language models (LLMs), continuing to make inroads in healthcare, it is critical to supplement traditional automated evaluations with human evaluations. Understanding and evaluating the output of LLMs is essential to assuring safety, reliability, and effectiveness. However, human evaluation's cumbersome, time-consuming, and non-standardized nature presents significant obstacles to comprehensive evaluation and widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, includes publications from January 2018 to February 2024.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare
