A Framework for Human Evaluation of Large Language Models in Healthcare   Derived from Literature Review

Thomas Yu Chow Tam; Sonish Sivarajkumar; Sumit Kapoor; Alisa V; Stolyar; Katelyn Polanska; Karleigh R McCarthy; Hunter Osterhoudt; Xizhi Wu,; Shyam Visweswaran; Sunyang Fu; Piyush Mathur; Giovanni E. Cacciamani; Cong; Sun; Yifan Peng; Yanshan Wang

arXiv:2405.02559·cs.CL·September 25, 2024

A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review

Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V, Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu,, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong, Sun, Yifan Peng, Yanshan Wang

PDF

Open Access

TL;DR

This paper reviews existing methods for human evaluation of large language models in healthcare, highlighting the need for standardization and proposing a comprehensive framework to improve evaluation consistency and reliability.

Contribution

It introduces QUEST, a structured framework for human evaluation of LLMs in healthcare, based on extensive literature review and analysis of current practices.

Findings

01

Identified diverse evaluation strategies across studies

02

Highlighted the lack of standardized evaluation methods

03

Proposed the QUEST framework for consistent assessment

Abstract

With generative artificial intelligence (AI), particularly large language models (LLMs), continuing to make inroads in healthcare, it is critical to supplement traditional automated evaluations with human evaluations. Understanding and evaluating the output of LLMs is essential to assuring safety, reliability, and effectiveness. However, human evaluation's cumbersome, time-consuming, and non-standardized nature presents significant obstacles to comprehensive evaluation and widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, includes publications from January 2018 to February 2024.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare